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The Congruence of Projective Instruments in Personnel 
Assessment 


Douglas M. More 


McMurry, Hamstra & Co., Chicago, Illinois 


The general problem of attaining reliable 
instruments, or techniques, for the assessment 
and placement of personnel is a considerable 
one. A still more difficult task is the estab- 
lishment of validity for such procedures. An 
enormous literature has addressed this effort 
in the area of “objective,” readily scored tests. 
However, for the past two decades there has 
been a burgeoning of studies on the utility of 
projective techniques to predict job function- 
ing of candidates. In general, the results of 
studies using predictions from projectives have 
been disappointing whenever an attempt has 
been made to go beyond intuitive clinical im- 
pressions toward quantified data amenable to 
statistical verification and comparison.' Only 
a handful of studies have indicated definitely 
positive and practically utilizable results (10, 
12). And, in the bulk of those instances we 
are hard put to decide whether the results ob- 
tained are attributable to the techniques used 
or to the personal skills of the analysts. This 
last problem is apt to remain with us for some 
time in spite of efforts to assess the relative 
contributions of individuals and techniques. 

The purpose of this study is to contribute 
further to the use of projective psychological 
instruments in making assessments of person- 
nel. We will report the congruence * of re- 
sults between different instruments and the 


1 Disheartening results in the use of projective in- 
struments are exemplified in such studies as: OSS 
Assessment Staff (9), the VA clinical psychology as- 
sessment project (3, 6), and the Study of Mate- 
Selection (14). 

* This term is used here in preference to the more 
common phrase “interinstrument reliabilities,” be- 
cause there is no assumption of parallelism between 
the tests involved (cf. Winch and More [15]). 
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among all instruments used in 
producing final ratings and rank orders of our 
subjects. A later paper will report validity of 
our procedures in predicting job performance. 


concordance 


Methods and ‘Techniques of Obtaining Data 


The subjects of this experiment were 63 pharma- 
fairly 
experience! 


homogeneous as to age, education, and 
We were fortunate, however, in obtain 
ing a considerable range of personality qualifications 
This came about because there is such a scarcity of 
registered pharmacists today to keep drug stores open 
for long evening and week-end hours that the com- 
pany had hired at 
as employees was in doubt 

Each man had administered to him the following 
projective instruments: a Patterned Interview (PI) 
(5), a Biographical Summary (BS) (12), a Sentence 
Completion Test (SC) (7, 13), and an abbreviated 
Thematic Apperception Test (TAT) (2, 8). One 
analyst saw only the transcript of the PI, a second 
the BS, and third the SC TAT. A fourth 
analyst also was involved in the first 31 cases treat- 
ing the Picture Frustration Test (R) 
(11).° We were forced to drop this from the test 
battery essentially 


cists, 


times men whose general worth 


a and 


Rosenzweig 


for reasons of time and financing 
even though the research group agreed that instru 
ment had made a significant contribution to our un 


‘Concordance is used here in its statistical sense 
(cf. Kendall (4, pp. 80-89]|), as a measure of com- 
munality of judgment (extent of mutual agreement) 
among three or more judges 

‘The average age of subjects was 26. All had 
completed a college degree in Pharmacy and were 
licensed pharmacists and, on the average, had slightly 
over three years of drug store experience as pharma- 
cists. None were younger than 21 years of age, only 
one was over 35. Somewhat over one third were 
single, and these were generally in the youngest half 
of the group : 

5’ The writer served as analyst of the SC and TAT 
He is grateful to L. T. Dickson for analyses of PI 
records, to G. J. Spencer for analyses of BS proto 
cols, and to Mary R. Holtzer for analyses of both 
PI and Rosenzweig records on different groups 

. 








Douglas M. More 


Table 1 


Interinstrument Congruences and Concordance of All Rankings ¢ 


Group I 19 25 22 
Group II 12 .28 .28 
Group II 9 33 Ad 
Group IV 11 AS 

Group V 12 33 12 


Total 63 


Test. 
rankings thereafter (cf. Kendall [4]}). 
Significant at less than .05. 
** Significant at less than .02. 
*** Significant at less than .01, 


derstanding of the cases on which it had been used. 
Each analyst also had available information on the 
subjects’ ages, marital status, education, and number 
of children, if any. They also had the subjects’ ob- 
jective test attainments on a test of general mental 
ability (Wonderlic), the SRA Non-Verbal Test, and 
the McMurry-Johnson Test of Number Relations. 
The projective analyses, therefore, cannot be consid- 
ered entirely “blind,” but they were independent. 

It is difficult to assess exactly how much the so- 
cial and test data available to all analysts contributed 
to general agreement without performing an ex- 
tremely laborious discriminant function analysis. 
However, the following points should be noted as 
limiting the scope of possible conclusions, or at least 
as influences confounded with results from the pro- 
jective analyses. First, married men received sig- 
nificantly higher final ratings than single men (t = 
11.560, df = 61, p < < .001), while mean age of the 
two subgroups is approximately the same. Of course, 
it may be argued that the married group is more ma- 
ture in social adjustment and more highly motivated 
to work. Second, the more intelligent men (Won- 
derlic scores) tended to be given higher final ratings 
(r= 41, N=63, p< .001). On the other hand, 
factors of age, the presence of children for men in 
the married group, and scores on the Non-Verbal 
test and Number Relations test all had negligible re- 
lationships with final ratings of subjects. 

Each analyst summarized the information available 
to him on each man under the following rubrics: 
(a) Energy, initiative, and work attitude; (b) In- 
telligence and creativity; (c) Attitudes toward su- 
periors; (d) Attitudes toward subordinates; (e) 
Major strengths; (f) Major weaknesses; and (g) 
Promotability to managerial responsibilities. He then 
rated the subject as 1—-Superior, 2—Well qualified, 
3—Marginal, or 4—Poor. These ratings provide the 
basis for the between-instrument, chi-square com- 
parisons given below in the section on Results. 

The 63 cases were sent to the raters in five groups 
of differing N. These subgroups Ns were, chrono- 


N AXB AXC AXD BXC BXD CxXD 


7 14 35° —.05 
36 Be A2 .24 
61 * 
38 
so 


1 A: Patterned Interview; B: Biographical Summary; C: Sentence Completion and TAT; D: Rosenzweig Picture Frustration 
Interinstrument correlations are rank order + (tau); concordance is W, based on 4 rankings for Groups I and II, and 3 


logically, 19, 12, 9, 11, and 12. To facilitate calcu- 
lation of interinstrument congruences, the analysts 
separately ranked each subgroup from one to N. It 
was our experience that the ranking of 19 cases rep- 
resented a comparatively formidable task, but the 
smaller groups were more easily amenable to this 
treatment. This procedure forced the analysts to 
make far more sensitive discriminations among cases 
than was provided by the simple rating on a scale of 
four. Some groups contained no person who had 
been considered “superior,” a rating of 1, by any 
analyst; but the ranking always provided an order- 
ing to which correlation methods could be applied. 


Results 


Table 1 summarizes the interinstrument 
rank order correlations (tau) and the level of 
concordance (W) among raters for the five 
groups. It may be noted that many of the 
interinstrument correlations do not reach a 
statistically acceptable level of significance. 


Table 2 


Interinstrument Congruences Over All Cases 


PI BS SC-TAT Rt 
PI : 

BS 42e** 

SC-TAT 40*** 

Rt 36" A 


FR§ Peg —_ — 


* Significant at <.05. 

** Significant at <.02. 

*** Significant at <.01. 

+ The r's in the column and row for the Rosenzweig test are 
based on N of 31; r's elsewhere are based on N of 63. 

§ Final” rank order is based on 3 rankings over all cases 
for r with PI, BS, and SC-TAT; but for 4 rankings between 
“Final” and R. 





Projective Instruments in Personnel Assessment 


Table 3 


Dis- 


Agree agree =x? 


PI X BS 19 
PI X SC-TAT . 27 
PIX R 10 
BS X SC-TAT i 24 
BS XR 

SC-TAT XK R 16 


10.09 
1.73 
3.78 
4.29 
9.81 
0.00 


However, the amount of mutual agreement 
among all raters is sufficient to produce sig- 
nificant concordances for all groups ranked. 
This finding encouraged us to form a unique 
ranking of subjects in each group from the 
sums of squares of the individual instrument 
rankings. This final ranking (FR) is con- 
sidered, in accord with the usual practice, the 
“best” ranking, and it was so used in the later 
validity study. 

The total interinstrument agreement over 
all cases was obtained by transforming the 
ordinal data to standard scores (1). These 
product-moment correlations are given in 
Table 2. It is clear from these relationships 
that a satisfactory level of agreement was 
reached between all pairs of instruments, with 
the exception of that between R and SC-TAT. 
The correlation of an instrument with the FR 
here is roughly equivalent to the correlation 
of a set of scores with the average of several 
sets of which it is a member, and in a sense is 
an estimate of the reliability of the FR (15). 

Table 3 provides further information on 
agreement between instruments in terms of 
the ratings (1, 2, 3, or 4) assigned the sub- 
jects. Since no disagreement was so extreme 
that one rater assigned a 4 when another had 
assigned a 1 (i.e., the corner cells opposite 
the positive diagonal in a 4 X 4 table were 
empty), these data were collapsed to 2 x 2 
tables with dichotomized categories as “above 
average” and “below average.” Three of the 
six comparisons made along these lines may 
be seen in Table 3 to have reached a statisti- 
cally acceptable level of significance. 


Conclusions and Discussion 


1. A statistically acceptable level of con- 
cordance has been found between rankings on 
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projective instruments of five groups of phar- 
macists, even though the levels of significance 
of individual instrument rank correlations on 
the five subgroups were acceptably high in 
only 5 of the 21 congruences reported 
(Table 1). 

2. Transformation of ranks to normal de- 
viation scores permitted calculation of inter- 
instrument congruences over all cases. These 
r’s (product-moment) reached an acceptable 
level of significance in all instances except 
between the Rosenzweig Picture Frustration 
Test and the joint analysis of Sentence Com- 
pletion and TAT material (Table 2). 

3. Chi-square tests of agreement between 
instruments as to rating of subjects “above 
average” or “below average” indicate substan- 
tial agreement between three pairs of instru- 
ments, and an unacceptable level of agree- 
ment between three other pairs (Table 3). 
The BS emerges as the best single instrument 
in these comparisons. 

These conclusions represent somewhat heart- 
ening results, particularly when judgments 
from different instruments are combined. We 
still do not obtain “congruences” equivalent 
in magnitude to the “reliabilities” ordinarily 
sought for parallel forms of objectively score- 
able tests. However, this loss, it appears to 
the writer, is more than compensated for by 
the extraordinary wealth of clinical informa- 
tion available from which to construct “case 
reports” for the use of a company personnel 
department. 

One surprising finding is that the two 
“depth” projective techniques and the two 
“chronological history” techniques did not 
pair off with each other. Both the Patterned 
Interview and the Biographical Summary 
contain a chronological record of subjects’ 
lives. The Rosenzweig Picture Frustration, 
Sentence Completion, and Thematic Apper- 
ception Tests all contain “fantasy,” projected 
responses. in spite of this difference, the 
highest congruence levels reached are between 
BS and SC-TAT. This result may be ac- 
counted for‘in part by the fact that the ana- 
lyst of the BS treated his instrument to a 
“depth” analysis of response contents follow- 
ing the practice of Spencer and Worthington 


(12). 
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The agreement found between analysts of 
projective psychological instruments as to the 
qualifications of 63 pharmacists is sufficiently 
acceptable to proceed with a study of the va- 
lidity of our conclusions. It is important to 
note that these results have been obtained on 
a sample highly homogeneous as to education, 
age, and occupation. Almost certainly, even 
higher agreements would be obtained if we 
were trying to differentiate men who were 
heterogeneous as to these sample qualities. 
We can conclude that our analysis did distin- 
guish personality differences within a single 
occupation in such a way as to be probably 
quite diagnostic of work functioning. 


Received September 21, 1956. 
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“Chance” Scores on the Strong Vocational 
Interest Blank for Men 


Samuel B. Lyerly 


Washington, D. C. 


Strong, in his book on interest measurement 
(1) and in the manual for his Vocational In- 
terest Blank for Men (2) has presented esti- 
mates of the standard deviation of “chance 
scores” for most of the scales for which the 
blank may be scored. These estimates were 
computed from the data of 40 blanks which 
had been filled out by throwing dice to de- 
termine the item responses. The “chance 
zone,’ which varies from one occupation to 
another, is useful in assessing the significance 
of scale scores, and is particularly meaning- 
ful when a number of scores fall within those 
areas. In such a case the counselor may sus- 
pect immaturity of interest patterns or care- 
lessness (or facetiousness) in filling out the 
blank. Prepared profile forms with the 
“chance” areas overprinted are widely used. 

In deriving these chance ranges Strong, in 
effect, was seeking to estimate, by means of 
a random sample of 40, the standard devia- 
tion of the distribution of scores resulting 
from filling out the blank in all possible dif- 
ferent ways. It is of course impossible to ac- 
complish this latter, since there are 3*°° ways 
to complete the blank without duplication. 

Fortunately, it is possible to learn all that 
we wish to know about such a distribution 
without actually producing it and without re- 
sorting to sampling experiments. Since the 
interitem correlations are all zero when the 
blank is filled out in all possible ways, the 
variance of the chance distribution of raw 
scores for a given scale is simply the sum of 
the 400 chance item variances (i.e., the vari- 
ances of the three weights assigned to each 
item). For example, in the Lawyer key, the 
weights for the first several items are 


Response 


Item L I D 
1 1 0 —1 
2 2 1 1 
3 —1 0 1 
4 1 0 0 
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Table 1 


Strong’s Estimated “Chance” Standard Deviations on 
the Vocational Interest Blank for Men 
and the Exact Values 


Strong’s Exact Difference 

Scale Estimate Value (Strong— Exact) 
Artist 3.6 3.33 27 
Psychologist (new) 4.52 
Architect 4.5 4.12 38 
Physician (new) 5.06 
Dentist 5.2 5.10 10 
Mathematician 4.3 3.76 54 
Engineer 5.2 4.28 92* 
Chemist 5.6 4.68 92° 
Production Mgr 44 4.38 02 
Farmer 3.5 3.93 AS 
Carpenter 4.4 4.88 48 
Aviator 3.7 4.32 1.38°* 
Math. Sci. Teacher 3.8 3.82 02 
Policeman 3.5 3.78 28 
Forest Service 50 5.26 26 
Y. Phys. Dir 4.1 4.31 21 
Personnel Mgr 5.4 5.04 6 
Public Admin 6.36 
Y. Secretary 4.7 3.92 78* 
Soc. Sci. Teacher 44 4.30 10 
School Supt. 4.9 4.91 Ol 
Minister 44 4.29 11 
Musician 4.9 5.31 41 
Senior C.P.A 4.31 
Accountant 5.1 5.17 07 
Office Man 4.7 4.27 43 
Purchase Agent 4.5 4.74 24 
Banker 44 4.10 0 
Sales Mer 4.0 4.16 16 
Real Estate 44 3.26 14 
Life Insurance 3.4 3.56 16 
Advertiser 44 3.40 00 
Lawyer 3.6 3.72 12 
\uthor-Journ 2.4 2.56 16 
Pres. Mfg. Concern 4.5 440 10 
Army Officer 4.98 
Physicist 4.42 
I. M. an 1.92 18 
M. F 3.4 3.23 17 
O.L 2.5 2.67 A7 


** F significant at 1° level 
* F significant at 5°) level 
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The variance of the weights for the first item is 
[17 + 0? + (— 1)*]/3 

—(1+0—1)?/9 = 2/3, 
and for the second, 


[(—2)?+1°+ 1] /3 
—(—2+1+41)*4/9=2. 


The sum of these variances over all 400 items 
is the chance variance, in raw-score terms, of 
the Lawyer scale. Variances for the other 
scales are computed in the same manner, 
using the keys for those scales. (Actually, as 
an experienced computer will immediately 
notice, there are short cuts by means of which 
the final variance may be found without cal- 
culating each individual item variance.) The 
raw-score mean of such a distribution, as 
Strong has indicated, is one-third of the alge- 
braic sum of the weights. Other parameters 
can be found, but it is sufficient to note that 
the distribution would be almost perfectly 
normal. 


Samuel B. Lyerly 


Table 1 lists the chance standard deviations 
(converted into Strong’s standard score sys- 
tem) for most of the scales as estimated by 
Strong and the exact values. Most of the 
differences are small, and the positive and 
negative errors of estimate are about equally 
divided. The greatest difference is in the 
Aviator scale, where Strong’s estimate is about 
1.4 standard score points too high. This 
difference is significant at the 1% level as 
judged by the F test. Three others (Engi- 
neer, Chemist, and Y. Secretary) are signifi- 
cant at the 5% level. 


Received June 20, 1956. 


References 


1. Strong, Edward K., Jr. 
men and women. 
Press, 1943. 

2. Strong, Edward K., Jr. Manual for vocational 
interest blank for men. Stanford: Stanford 
Univer. Press, 1951. 


Vocational interests of 
Stanford: Stanford Univer. 





Journal of A as Psycholo 
Vol. 41, 4,49 , 1957 . - 


Stability Measures of Strong Vocational Interest Blank 
Profiles * 


Leslie A. King 
The General College, University of Minnesota 


A number of studies have been made con- 
cerning the stability of Strong Vocational In- 
terest Blank (SVIB) profiles (8, 9, 10, 11, 
12). A specific problem in SVIB stability 
studies has been the development of an ade- 
quate measure of stability. This difficulty is 
basically a profile similarity problem, with 
each profile composed of 44 test scores. 

Considerable attention is currently focused 
on the general problem of profile similarity. 
Gaier and Lee (3) in 1953 reviewed several 
profile similarity measures and discussed some 
of the problems involved in handling profiles. 
In 1954 Helmstadter (4) made a survey of 
22 profile similarity measures and used data 
from 270 geometric solids for the profiles to 
be analyzed for similarity. The applicability 
of Helmstadter’s findings to SVIB profiles is 
unknown. Cottle (1) in 1954 pointed out 
that existing systems for handling profiles do 


not adequately show both similarity in profile 
shapes and differences in height between the 


profiles. Livson and Nichols concluded in 
1955 that “. . . to our knowledge, the prob- 
lem of the statistical evaluation of the differ- 
ence between individual profiles, while assidu- 
ously besieged, remains unconquered” (7, p. 
38). 

The traditional measure of SVIB profile 
similarity has been the rank correlation tech- 
nique. There has been only one study of the 
meaningfulness of rho as a measure of SVIB 
stability. Data for this study has not yet 
been published, but Donald Hoyt reported at 
a conference honoring E. K. Strong, Jr., held 
at the University of Minnesota in 1955 (6) 
on a study he recently completed. Hoyt had 
counselors rate the amount of change in in- 
terpretation between test-retest profiles they 
would make in counseling with the SVIB. 


1 This paper is based upon a portion of a Ph.D. 
thesis submitted to the graduate faculty of the Uni- 
versity of Minnesota. The author wishes to ac- 
knowledge the guidance of his advisors, Dr. Willis 
E. Dugan and Dr. Cyril J. Hoyt. 


Hoyt concluded that the rank correlation co- 
efficient was a meaningful index of interest 
stability. However, research by the writer 
leads him to conclude that rho may not be 
the best measure of interest score stability 
from a counseling viewpoint. Counselors em- 
phasize group patterns as defined by Darley 
(2, pp. 76-77) and also give considerable at- 
tention to letter grade scores of the individual 
occupational scales. Rho is not based on 
changes in group patterns and letter grade 
scores. 

Powers in 1956 (8) analyzed test-retest 
SVIB profiles for Primary (P), Secondary 
(S), Tertiary (T), and Reject (R) patterns 
in each of the 11 interest groups using Dar- 
ley’s (2, pp. 76-77) method for the P, S, 
and T patterns and classifying all others as 
Rejects. She then tabulated the difference in 
patterns for each subject in each of the 11 
groups. For example, if a subject has a Pri- 
mary pattern in Group I in the first test and 
a Tertiary pattern in the second test, he was 
assigned a difference of two. These differ- 
ences, regardless of sign, were summed for 
each individual and called D. A chi-square 
test indicated that D was significantly related 
to rho. 

Two new objective measures of SVIB pro- 
file stability are described and illustrated in 
this paper. Their relationship to other sta- 
bility measures is also reported. 


Method 


Subjects. The subjects (Ss) of this study were all 
the male high school graduates who entered the Gen- 
eral College of the University of Minnesota in the 
fall quarter of 1954, as freshmen, took the SVIB 
during the orientation-registration program prior to 
the start of classes, and completed their third quar- 
ter in the General College in the spring quarter of 
1955. The number of Ss was 242. The students 
ranged in age from 16.5 to 27.5 years at the time of 
the fall administration of the SVIB with a median 
age of 19.0, a mean age of 20.0, and a standard 
deviation of 2.6. The range in high school percentile 
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ranks was from 1 to 77 with a mean rank of 27.0 
and a standard deviation of 164. The mean fall 
SVIB Interest-Maturity score was 48.8 with a stand- 
ard deviation of 7.6. The mean General Aptitude 
Test Battery “G” score was 107.0 with a standard 
deviation of 11.3. Eighty-eight students (36.4%) 
were veterans and 31 (12.8%) were married. 

It should be noted that the General College stu- 
dent population is a relatively unique junior college 
population. Most of the students entering the Gen- 
eral College have unrealistically high educational- 
vocational goals, and since only a small percentage 
eventually graduate from a senior college or profes- 
sional school, these students are subject to consider- 
able pressure from the General College staff to change 
educational-vocational goals and self concepts dur- 
ing the first year in college. 

Procedures. The Ss were retested on the SVIB 
during the latter part of the spring quarter of 1955. 
The interval between SVIB administrations averaged 
nine months, All SVIB’s were scored on 44 occupa- 
tional scales. 

Five stability measures were compared. The first 
was Powers’ method based on shifts in the 11 group 
patterns. These values were called D scores. 

The second stability measure was based on letter 
grade changes for each of the 44 occupational scales. 
For example, a value of one was assigned for a 
change of one category such as from B to B+ and 
a value of five for a change of five categories such 
as from A to C. The numerical values thus obtained 
for all 44 scales, regardless of sign, were summed for 
each individual and called L scores. 

The third stability measure was based on both 
group pattern changes and letter grade shifts. The 
D and L scores were converted to standard scores 
and these scores were summed for each individual 
These sums were called S scores. 

The fourth measure was a rating of the extent of 
the interest changes between fall and spring SVIB’s 
by three experienced counselors. These three coun- 
selors have the Ph.D. degree in psychology or educa- 
tional psychology and have had a number of years 
experience using the SVIB as a counseling instru- 
ment. The investigator did not ask each counselor 
to rate all the profiles because of the large number 
of profiles. A random sample of 32 cases was se- 
lected for rating by all three counselors. These 32 
cases were used in testing the reliability of the rat- 
ings. From the remaining 210 cases, 70 were drawn 
at random for the first counselor, 70 for the second 
counselor, and the remaining 70 were given to the 
third counselor. The order of presentation of the 
102 pairs of profiles for each counselor was deter- 
mined in a random manner. Identifying data such 
as name, age, and scores on the nonoccupational 
scales were deleted from the profiles so that this in- 
formation could not influence the ratings. In order 
to approach actual counseling procedures with the 
SVIB, the profiles were marked so that the coun- 
selors knew which profiles resulted from the fall test- 
ing. The counselors were asked to rate the similarity 
of each pair of profiles by answering this question: 
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“To what extent is there a change in the vocational 
interests of this student?’’ In answering this ques- 
tion, the counselors used the following five-point rat- 
ing scale: 

1. There is no change or only an insignificant one. 

2. There is a small change, but the basic interests 
are still very similar. 

3. Most of the interests are similar, but there is at 
least one major change. 

4. Some of the interests are similar, but there are 
at least two major changes. 

5. There is a great change of interests—the inter- 
ests are definitely more different than similar. 

The reliability of the counselors’ ratings for the 
32 cases rated by all three counselors was deter- 
mined by the method proposed by Hoyt and Stunk- 
ard. The method of probits developed by Bliss was 
used to obtain standard normal ratings for the en- 
suing statistical analysis. 

The fifth stability measure was the traditional one, 
ie., the rank correlation coefficient. Rho was deter- 
mined for 50 cases drawn at random. It should be 
noted that a high rho value indicates relatively stable 
interests whereas high numerical values of D, L, S, 
and counselors’ ratings indicate the unstable end of 
the stability continuum. 


Results 


Distribution of stability scores. The dis- 
tributions for D, L, and S scores for the 242 


Table 1 
Distribution of D Scores for 242 General College 
Freshmen on 11 Occupational Groups of the 
Strong Vocational Interest Blank 


(Mean = 5.3. SD = 2.9) 


D Scores N 


15 
14 
13 
12 
11 
10 
9 
8 
7 


6 


Total 
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pairs of SVIB profiles are given in Tables 1, 
2, and 3, respectively. Statistical significance 
tests could not be made because the theo- 
retical sampling distributions are unknown. 
However, an inspectional comparison of the 
results with the possible maximum and mini- 
mum scores gives an indication of the relative 
permanence of the profiles. The maximum L 
score for a pair of profiles, based on 44 occu- 
pational scales, is 220 and the minimum is 
zero. The actual range of L scores was from 
3 to 88 with a mean score of 32.8 and a 
standard deviation of 13.5. The largest pos- 
sible D score is 33 and the smallest is zero. 
The actua! range was from zero to 15 with a 
mean of 5.3 and a standard deviation of 2.9. 
It is evident that the L and D scores were not 
nearly as large as theoretically possible. 

The range of S scores was from 63 to 164 
with a mean of 99.8 and a standard deviation 
of 18.8. 

Counselors’ stability ratings. The ratings 
made by each counselor for each of the 32 
cases ranged from 1 to 5. Counselor A had a 


Table 2 
Distribution of L Scores for 242 General College 
Freshmen on 44 Occupational Scales of the 
Strong Vocational Interest Blank 


(Mean = 32.8. SD = 13.5) 


L Scores N 
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Table 3 


Distribution of S Scores for 242 General College 
Freshmen on the Strong Vocational 
Interest Blank 


(Mean = 99.8. SD = 18.8) 


S Scores 


= 


1600-164 
155-159 
150-154 
145.149 
140-144 
135-139 
130-134 
125-129 
120-124 
115-119 
110-114 
105-109 
100-104 
95. 99 
9094 
85-89 
8O 84 
75 

70-74 
65 

60 64 
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mean rating of 2.9; Counselor B, 2.3; and 
Counselor C, 2.1. The analysis of variance 
of the ratings indicated that the average 
change in interests was not judged to be the 
same by each counselor. The differences were 
significant at the .O1 level. 

For the 32 cases, there were seven cases 
(21.9%) of perfect agreement; there were 21 
cases (65.6%) for which two counselors 
agreed; there were four cases (12.5%) for 
which all three counselors disagreed. ‘The re- 
liability coefficient for the average of the 
three counselors’ ratings was .86. This was 
high consistency for the 32 pairs of profiles 
and indicated that confidence could be placed 
in single counselor's ratings for the 210 cases 
rated by only one counselor. The reliability 
coefficient for a single counselor’s rating 
would be .67 by applying the Spearman- 
Brown formula for a test of one-third length. 

The mean rating for the 102 pairs of pro- 
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Table 4 


Product-Moment Correlations of D, L, and S Scores 
with Counselors’ Ratings and Rank 
Correlation Coefficients 


(All coefficients are significant at the .01 level) 


Stability 
Measure 


Counselors’ 
Ratings rho 


D 5 — .70t 


L .68* 
5 A* 


* Based on total sample of 242 cases. 
+ Based on a random sample of 50 cases. 


files judged by Counselor A was 3.1; by 
Counselor B, 2.3; and by Counselor C, 2.0. 

Rank correlations. The rank correlation 
coefficients for 50 cases ranged from .95 to 
.26 with a median rho of .85 and a mean rho 
of .80. There was only one case in which rho 
was not significantly different from zero at 
the .0O1 level. The distribution of the rho 
values was negatively skewed. 

Interrelationships of the stability measures. 
Product-moment correlations of D, L, and S 
scores with counselors’ ratings and with rho 
values are given in Table 4. All of the co- 
efficients are significant at the .01 level. The 
correlations between D, L, S scores and coun- 
selors’ ratings are based on the total sample 
of 242 cases while the correlations between D, 
L, S scores and the rank correlation coeffi- 
cients are based on a random sample of 50 
cases. The product-moment correlation co- 
efficient between D and L scores based on the 
total sample was .76 which is also significant 
at the .O1 level. 

The product-moment correlation coefficient 
between the rho values and counselors’ rat- 
ings for a random sample of 50 cases was 
— .54 which is significant at the .01 level. 
The correlation between the S scores and the 
counselors’ ratings was .60 for this sample of 
50 cases. (The correlation between the S 
scores and the counselors’ ratings for the total 
sample was .64.) 

A test of the significdnce of the difference 
between correlation coelficients failed to re- 
ject the hypothesis that counselors’ ratings 
were correlated equally high with the rho 
values (— .54) and with the S scores (.60). 


The difference in correlations between the D 
scores and rho’s, — .70, and L scores and 
rho’s, — .90, was significant at the .01 level. 

The difference in correlations between D 
scores and counselors’ ratings, .55, and L 
scores and counselors’ ratings, .68, for the to- 
tal sample of 242 cases was significant at the 
.O1 level. 


Discussion 


There is a logical expectation of a positive 
correlation between D and L scores since by 
definition there could be no change in an 
SVIB group pattern category without a shift 
of at least one occupational scale letter grade. 
There could be letter grade changes, of course, 
without a change in the group pattern cate- 
gory. The high correlations between the 
writer’s two new objective profile stability 
measures, L and §, and the traditional meas- 
ure, rho, were unexpected since rho is not 
based directly on shifts in letter grades or 
group patterns. 

Higher reliability for the counselors’ rat- 
ings could have been attained, probably, if 
the writer had defined the term “change in 
vocational interests” and set forth criteria for 
determining what constituted an insignificant 
change, etc., and then instructed the coun- 
selors to rate the profiles accordingly. Since 
the writer did not so structure the instruc- 
tions, the counselors were free to use their 
counseling experience and clinical intuition as 
the basis for their ratings. However, the ob- 
tained reliability was satisfactory for the pur- 
pose of this study. The ratings indicated 
that the counselors considered the average in- 
terest changes to be relatively minor in ex- 
tent. Some students, however, showed major 
changes in their interests. 

The two new objective stability measures, 
L and §S, as well as the other two measures, 
rho and D, were only moderately related to 
counselors’ ratings of interest changes. If 
the counselors’ ratings are taken as the va- 
lidity criterion, all four objective measures 
leave something to be desired as measures of 
interest permanence. The question of whether 
or not the counselors’ ratings are valid meas- 
ures cannot be directly answered. It seems 
reasonable to believe, however, that experi- 
enced counselors have competence to judge 
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the extent of interest changes from a study 
of test-retest profiles. 

Both of the two new measures, L and S, 
and especially L, have the advantage of be- 
ing more easily calculated than rho. D is 
also easier to compute than rho. A statisti- 
cal test of significance exists for rho, but for 
many SVIB studies, and most counseling pur- 
poses, this theoretical advantage may be of 
little or no importance. Although not sta- 
tistically significant, there is an indication 
from this study that L and §S are more highly 
related to the validity criterion than rho or 
D. Since rho is apparently not a more mean- 
ingful and valid measure than the other three 
objective stability measures, its continued use 
as an SVIB profile stability measure is of 
questionable advantage. The writer recom- 
mends, therefore, the use of either L or S as 
SVIB profile stability measures. The dis- 
tribution of scores for D, L, and § given in 
Tables 1, 2, and 3 may be used as tentative 
normative data for college freshmen retested 
after an interval of one academic year. 


Summary 


A sample of 242 college freshmen who had 
completed the SVIB during early fall of 1954 
were retested at the end of one academic year. 
This study described two new objective SVIB 
profile stability measures, investigated their 
relationship to three other stability measures, 
and presented normative data for the meas- 
ures. The two new measures were L scores, 
based on letter grade changes, and §S scores, 
based on both letter grade and group pattern 
shifts. The three other profile stability meas- 
ures were: (a) Powers’ D-score method, (5) 
rank correlation coefficients, and (c) ratings 
of the extent of interest changes by experi- 
enced counselors. The counselors’ ratings 
were taken as the validity criterion. 

The stability measures were all significantly 
intercorrelated. The two new measures were 
highly correlated with the rank correlation 
coefficients and were moderately correlated 
with counselors’ ratings. All of the four ob- 
jective measures were significantly correlated 
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with the validity criterion ranging from .55 
to .68. The results showed that the new 
measures were valid measures of SVIB pro- 
file stability. There was an indication that 
the new measures are more closely related to 
counselors’ ratings of the extent of interest 
changes than are rho and Powers’ D. The 
writer recommended the use of either one or 
both of the new profile measures on the basis 
of ease of computation and demonstrated 
validity. 


Received September 6, 1956. 
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The Effect of the Anticipatory Startle Pattern on 
Aiming a Rifle’ 


H. C. W. Stockbridge 
Ministry of Supply, U. K. 


The object of the experiment was to in- 
vestigate the effect of the anticipatory startle 
pattern on aiming a rifle using a photographic 
technique. 

Method 
Apparatus 


The apparatus is fully described by Spring (6). 
A mirror was mounted on the muzzle of a rifle be- 
low and at a right angle to the line of flight of the 
bullet. This rifle was aimed over an optical system 
which projected a vertical and a horizontal line of 
light on to the mirror. These two lines of light were 
reflected into the apparatus where they were re- 
corded photographically, giving a measure of move- 
ment in traverse and elevation. Use of this appa- 
ratus was not found to hinder marksmanship (8). 


Design and Procedure 


Three observers were used. To obtain an estimate 
of his personal bias each observer aimed the rifle 21 
times using a vice as a rest. Photographic records 
of these aims were taken and standard deviations 
calculated. 

Before any subject shot, one of these three ob- 
servers laid the rifle on the center of the target, using 
a rest, and recorded what he thought was the true 
point of aim. The end points of the subject’s photo- 
graphic records could then be measured with respect 
to these center lines made by the observers, and mean 
errors calculated. Records could also be measured 
with respect to the timing marks on the film and 
standard deviations calculated. When live rounds 
were fired, three shots only were allowed for each 
‘target. To reduce error due to the recoil shifting 
the mirror on the rifle, observers recorded center 
lines after every three live shots. Error could in 
this instance be calculated either from the targets or 
the photographic record. 

Nine subjects fired eight live rounds, eight blank 
rounds, and eight times with an empty firing cham- 
ber. These three conditions were randomized in a 
3 3 latin square, using three replicates. 

Subjects were made as comfortable as possible, and 
every effort was made to allow them to take up 
their normal firing position. At 25 yards many sub- 


1A fuller account may be found in the original 
M.A. thesis, filed in the University of Reading Li- 
brary, England. The author wishes to thank Prof 
R. C. Oldfield for his guidance, Mr. J. Draper for 
the statistical analysis and Mr. K. H. Spring for the 
apparatus. The British Crown Copyright of this 
paper is reserved. It is published with the permis- 
sion of H. B. M. Stationery Office. 
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jects could see their own shot holes; knowledge of 
results was also given verbally. 


Results 


No significant differences were found either 
in means or standard deviations for eleva- 
tion or traverse between live, blank, and no 
rounds; Table 1 gives these means and stand- 
ard deviations. 

Shot holes in targets were measured to 0.5 
millimeters by inserting the base of a .303 
bullet in each puncture. This bullet had its 
center punched into it, and the distance of 
that center from either axis could be read off 
a steel rule with the aid of dividers. 

Photographic records were measured with 
a traveling microscope accurate to 0.02 milli- 
meters. For each shot, then, a record was 
fixed under the microscope and a measure- 
ment of the position of the timing marks, 
elevation, and traverse traces made. By sub- 
tracting the first measurement from either of 
the others, the distance of the traverse or 
elevation traces from the timing marks was 
obtained. The position of the timing marks 
was fixed with respect to the base plate of the 
apparatus. These figures alone were suffi- 
cient for the calculation of standard devia- 


Table 1 


Means of Means of Traverse and Elevation Errors. 
(Minutes of arc. Corrected for observer bias) 
Means of Logo SD’s of Traverse and 
Elevation Readings 


No 
shot 


Record 


Condition (live) 


Blank S.E. 


1.935 
1.619 


—2.83 
—4.01 


—1.16 
—3.76 


Traverse mean 

Elevation mean 

Logi traverse 
mean 

Logi elevation 
mean 


0.49 
—4.17 


0.55 0.63 0.76 0.077 


0.61 0.52 0.70 ~=0.057 


Note.—The S.F. is of a mean obtained from analysis of 


variance. It is identical for each of the three conditions. The 
S.E. of a difference between any two condition means is easily 
calculable. 
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tions. To obtain mean deviations, the ob- 
servers’ estimates of points of aim were calcu- 
lated in the same way and these, in turn, sub- 
tracted from the subjects’ readings. 

To make the distribution of the data a 
closer approximation to normality and so to 
give greater justification for the use of analy- 
sis of variance, a log,, transformation was 
used. It was found that the difference be- 
tween standard deviation in elevation and 
traverse was not significant when these sets 
of readings were combined in an analysis of 
variance, photographic records being used for 
live rounds. ° 

An analysis of variance of mean errors from 
the center of the target was undertaken. 
Again it was possible to combine elevation 
and traverse readings. The “between men” 


effect was highly significant (p < .01) while 
the “between traverse and elevation” effect 
was significant (p < .05). 


Discussion 


Landis and Hunt (3) offer evidence on the 
startle pattern in markmanship. They sug- 
gest that shoulder movement occurs in a 
startle pattern 100 to 150 milliseconds after 
the sound of a shot, and that the latent time 
for response of the arm is 125 to 195 milli- 
seconds and of the hand 145 to 195 milli- 
seconds. The bullet might still be in the 
barrel as long as 1.5 milliseconds after the 
cartridge had been fired. This estimate, 
though approximate, is very much less than 
the latent time of the startle response, the 
minimum time for the shoulder as given above 
being 100 milliseconds. It is thus unlikely 
that the startle pattern affects the flight of 
the bullet, but this, of course, does not pre- 
clude conditioning which might cause the sub- 
ject to jump before the rifle fired. These au- 
thors found that conditioning appeared only 
in subjects who gave a strong response to the 
first shot and who did not enjoy the experi- 
ence. 

Landis and Hunt (3) also investigated the 
response in 11 trained New York policemen 
firing .38-in. revolvers. They found that 
“  . , the lid reflex never disappears through 
habituation, and usually head movement is 
found with it.’”’ Using a camera running at 
1,500 exposures a second, as was the case 
with the latent times given above, 45 re- 
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sponses from 11 subjects gave a reaction time 
for blink of 20 to 54 milliseconds with an av- 
erage of 40 milliseconds and a standard de- 
viation of 7 milliseconds. It is thus unlikely 
that visual feedback is cut off before the bul- 
let has left the revolver; jump may, of course, 
have begun in the barrel. The policemen did 
not become habituated to the shot in a series 
of shots, nor did the more experienced of 
them blink less. Thus here also no connec- 
tion was found between startle pattern and 
marksmanship. 

Hick (2), Gates (1), Stevens (7), and Me- 
Guigan and MacCaslin (4) consider tremor 
and marksmanship while Saul and Hirsch 
(5) give a useful review of 30 papers on 
more general topics. 


Summary 


Photographic apparatus previously found 
not to affect marksmanship was used to in- 
vestigate the anticipatory startle pattern. 
Records were taken of nine men firing eight 
live rounds, eight times with an empty firing 
chamber and eight blank rounds. The data 
were statistically analyzed. The results of 
this experiment suggest that the anticipatory 
startle pattern did not seriously affect the 
marksmanship of the subjects used. 


Received July 2, 1956. 
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Exposure Time as a Variable in Dial Reading Experiments ' 





David R. Thomas 


Duke University 


In 1948, Sleight (5) published a study of 
the influence of dial shape on legibility. Pre- 
senting the dials tachistoscopically with a 
.12-sec. exposure time, he found that, in order 
of decreasing accuracy, the dial shapes which 
he compared were ranked as follows: 1—open- 
window; 2—round; 3—semi-circular; 4— 
horizontal; and 5—vertical. The exposure 
time of .12 sec. was determined on the basis 
of a preliminary study. Five Ss were pre- 
sented with the five dial shapes at .28 sec., .20 
sec., .17 sec., .14 sec., and .12 sec. and no 
significant differences attributable to exposure 
time were found. The .12-sec. exposure time 
was selected for use in later research because 
this condition produced the greatest differ- 
ences between the dial types. 

In a subsequent article, Grether (3) criti- 
cized the tachistoscopic procedure in general, 
and Sleight’s use of this technique in particu- 
lar. He pointed out that tachistoscopic con- 
trol of exposure time does not constitute con- 
trol of response time, but serves rather to 
restrict the number of visual fixations by the 
subject on the displayed material. The actual 
response may be delayed for several seconds 
during which the S maintains a “mental 
image” of the indicator scale and pointer. 
Grether suggested that in Sleight’s experi- 
ment the use of a controlled exposure time 
which did not permit a change in the prepara- 
tory eye fixation thereby favored the open- 
window dial, because the position of its 
pointer was fixed and could always be antici- 
pated by S. In support of his criticisms of 
Sleight’s procedure, Grether (3) reported that 
when a test booklet, pencil and paper test 
rather than the tachistoscopic method was 
used, the fixed pointer indicators showed no 


1 This paper is based on a thesis submitted by the 
author to the faculty of Brooklyn College for the 
degree of Master of Arts in Psychology, June, 1956. 
The author wishes to thank his thesis advisor, Dr. 
R. A. Harris, and also to express his appreciation to 
Drs. K. V. Wilson and G. A. Kimble of Duke Uni- 
versity for their help with the preparation of this 
report. 


general superiority over comparable moving 
pointer indicators. 

Investigators who prefer the more con- 
trolled conditions of tachistoscopic stimulus 
presentation to the test booklet method might 
also question Sleight’s choice of exposure 
time. The range of times tested in his pre- 
liminary study was small (from .12 sec. to 
.28 sec.), and the size of his sample was small 
(N = 5). It may be that the use of more 
extreme exposure times would have produced 
significantly different results. Christensen (1) 
has reported exposure time to be an exceed- 
ingly important variable in his experiments 
on dial design. He found that relationships 
discovered at one exposure time might be in- 
significant or even reversed at other exposure 
times. 

The contradictory findings of Sleight and 
Grether suggest the importance of investigat- 
ing the legibility of various dials under con- 
ditions where eye fixations are controlled. 
Such a study might also concern itself with a 
related though separate problem—the effect 
of differences in exposure time on dial legi- 
bility. To these ends, an experiment was 
conducted in which miniature dials closely 
resembling those used by Sleight were used. 
It was assumed that, with miniature dials, 
differences in the number of eye fixations re- 
quired for the different dial shapes would no 
longer be an important factor. Superiority of 
the open-window dial over the moving pointer 
indicators under these adjusted conditions 
would raise serious questions concerning the 
“eye-fixation hypothesis.” 


Method 


Subjects. Ss were eighty college students, 63 men 
and 17 women, who served in the experiment on a 
voluntary basis. All had approximately 20-20 vi- 
sion, natural or corrected, as determined by the 
Snellen eye chart. 

Apparatus. Our aim in this study was not to 
duplicate Sleight’s dials but rather to use them as 
models for others having similar general character- 
istics though differing in minor particulars. The same 
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five dial types employed by Sleight were used in the 
present experiment. All the dials were identical in 
size and form of numerals, size of the graduations, 
distance between graduations, and dimensions of the 
pointer. Whereas in Sleight’s experiment the round 
dial was 24 in. in diameter and the others were in 
proportion, in the present study the round dial had 
a diameter of 4 in. with the others proportionately 
drawn. Our dials also differed from Sleight’s in scale 
length—his ranged from 0 to 10, ours from 0 to 6. 

Procedure. The eighty Ss were divided into four 
groups of twenty Ss each. Group I viewed all the 
dials at .50 sec. exposure time, Group II at .10 sec., 
Group III at .04 sec. and Group IV at 02 sec. 
After appropriate instructions, each S viewed nine 
settings for each of the five dial types at the proper 
exposure speed for his group. The Ss were tested on 
all nine settings of one dial type before progressing 
to the next type. The order of presentation of the 
five dial types, of the nine settings within each set, 
and the composition of the four groups of Ss were 
all determined randomly. 

Results. Figure 1 presents the results of the ex- 
periment in the form of five “Deltagraphs,” one for 
each of the four exposure time conditions, and one 
for the data across pooled exposure times. Viewing 
the pooled data for all exposure times it is seen that 
the horizontal dial ranks first in accuracy, the round 
dial second, the vertical dial third, the open-window 
dial fourth, and the semi-circular dial fifth. The 
data for the different exposure times considered 
separately show the relationship among four of the 
dials to be stable, with the horizontal dial read most 
accurately, the round dial next, then the vertical dial, 
and the semi-circular diai read poorest of the four. 
Only the open-window dial changes its relative rank 
with changes in exposure time. This dial is read 
second best at the .SO-sec. and .10-sec. exposure 
times, but fifth best at the .04-sec. and .02-sec. ex- 
posures. 

These data were submitted to an analysis of vari- 
ance (Lindquist Type I) (4). For the variation be- 
tween exposure times, F = 2940 (df=43, 76). For 
that between dial shapes, F = 37.49 (df= 4, 304). 
For the interaction between dial shape and exposure 
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Fic. 1. Dial legibility as a function of exposure time. 
time, F = 5.01 (df= 4, 304). All three of these F 
ratios are highly significant (p< 01). Thus, under 
the conditions of the present experiment, both the 
dial shape and the exposure time employed are im- 
portant determiners of the accuracy of dial reading 
performance. 

For the purpose of evaluating the comparisons be- 
tween the different dial shapes at the different ex- 
posure times a series of t tests was computed, using 
the mean square for interaction as a basis for the 
error term. The results of this analysis are also pre 
sented in Fig. 1. Both the critical differences signifi 
cant at the 0S and 01 level are reported with the 
Deltagraphs. 

The second Deltagraph in Fig. 1 allows us to 
evaluate Grether’s “eye-fixation hypothesis.” At the 
10-sec. exposure time (closest to the .12 sec. used by 
Sleight), the open-window dial is not the most ac- 
curately read dial but rather ranks second behind the 
horizontal dial. Thus, the results with small dials, 
where the differences due to eye fixations are mini 
mized or eliminated, do not correspond to the re 
sults reported with larger dials, and Grether’s “eye 
fixation hypothesis” appears substantiated. 

Table 1 shows the distribution of errors over the 
different scale points tested with each dial. A strong 
tendency for fewer errors to be made at certain “ref 
erence points” on some of the dials is revealed. By 
“reference points’ we mean positions on the scale 
which are most easily recognized, eg., the 12, 3, 6, 
and 9 o’clock positions on most round dials, the 9, 


Table 1 


The Distribution of Errors on the Five Dial Types 


Pointer Settings 


0 0.5 1 2.5 


Horizontal 6 26 14 : 6 
Vertical x 35 % 6:2! : 32 
Round 25 x 31 : ) x 
Semi-Circular x 43 52 j 12 
Open-Window 22 37 27 26 


Nine pointer settings were used with each dial. 
in the appropriate column. 


5.5 Total 


& 13 101 
14 x 226 
17 19 184 
43 37 x 268 
25 31 23 242 


Where a given pointer setting was not used for a particular dial, an x appears 








152 David R. 


12, and 3 o'clock positions on semi-circular dials, and 
the two end points and the center point on the hori- 
zontal and vertical scales. Since no point on the 
open-window dial is any easier to recognize than any 
other point, it might be predicted that the distribu- 
tion of errors on this dial would be rather evenly 
spread out. Table 1 shows that this prediction is 
indeed verified; the open-window dial shows the 
most nearly regular error pattern with little variation 
among pointer settings. 


Discussion 

More important than the finding that both 
dial shape and exposure time are significant 
variables in the dial reading situation, is the 
disclosure of a significant interaction between 
these two variables. Thus, a major problem 
connected with the tachistoscopic method of 
stimulus presentation is revealed. At what 
level should the exposure time be fixed? 
Whatever choice is made, the results of the 
experiment will be a function of the particular 
exposure time employed. One alternative 
method suggested has been the use of printed 
examination booklets. The validity of this 
procedure is difficult to assess. Offhand, dial 
reading would appear to be a speed rather 
than a power function; thus, one would ex- 
pect the appropriate test to be a speed test. 
Yet the tachistoscopic method involves the 


pre-selection of an exposure time which then, 
to a considerable extent, determines the na- 


ture of the experimental findings. A number 
of recent investigators (2, 6) have demon- 
strated the validity of printed test results in 
a variety of applied dynamic situations. 
These researchers offer strong support for 
the adoption of the test booklet method. 
The present experiment, by raising serious 
questions about the validity and/or general- 
izability of findings obtained by the tachis- 
toscopic procedure, may be considered as of- 
fering support for their position—by default. 


Summary 


An experiment was performed to compare 
the relative legibility of horizontal, vertical, 


Thomas 


round, semi-circular, and open-window dials 
at four different exposure times, .50 sec., .10 
sec,, .04 sec., and .02 sec. The legibility 
rariking of the five dials was found to vary 
with the exposure speed, due to the unreli- 
ability with which the Ss were able to read 
the open-window dial. It ranked second best 
at slow exposures (.50 sec. and .10 sec.), but 
fifth best at rapid exposures (.04 sec. and .02 
sec.). The other four dials retained their 
relative positions with the ranking in terms 
of accuracy as follows: 1—horizontal; 2— 
round; 3—vertical; 4—semi-circular. 

The interaction between dial shape and ex- 
posure time raises some question of the va- 
lidity of the tachistoscopic method of stimu- 
lus presentation in studies of this type. It 
has been shown that the particular exposure 
time used may bias the results of such an ex- 
periment and thus make valid inference to 
other situations impossible. 


Received August 3, 1956. 
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Effect of Design on Accuracy and Speed of Operating Dials ' 


Roger J. Weldon and George M. Peterson 


University of New Mexico 


Equipment is sometimes controlled directly ; 
that is, the action of the equipment is ob- 
served and the controls adjusted to bring the 
action to a point desired by the operator, as 
in steering a car. In such cases, corrections 
are made as errors are discovered during op- 
eration. In other cases, equipment is con- 
trolled by presetting information into it, as in 
setting an alarm clock to go off at 6 A.M. 
Frequently, when information is set in this 
manner, there is a point in time after which 
an error cannot be corrected; an example 
would be the case where a computer is pre- 
pared completely before computations begin 
and cannot be altered while the process is 
going on. 

Some preset operations are sufficiently im- 
portant so that it is of some consequence to 
prevent errors from being set into the equip- 
ment. This practical consideration led to the 
study reported here, in which three types of 
dials were investigated to determine the com- 
parative accuracy with which information is 
set into them. The two criteria of merit used 
were the number of errors made in setting and 
checking the dials and mean time taken to 
set and check them. 

Although there is a considerable amount of 
literature upon the reading of instruments, 
there is very little upon the operation of set- 
ting information i.to instruments. Actually 
the only study located on this latter operation 
was that of Bradley (1) who tested a number 
of varieties of the open-window type of dial 
(moving scale). Chapanis (2, p. 137) and 
Sleight (4, p. 171) refer to unpublished stud- 
ies on counter dials, adding that such counter 
dials are not best for setting information into 
instruments. The present authors have not 
been able to obtain these studies. 

‘This study was developed for the Sandia Cor- 
poration by the Psychology Department of The Uni- 
versity of New Mexico in accordance with Sandia 
P. O. 51-0315. The data of this study have been re- 
ported in detail in Research Report SC-3659 (TR) 


Engineering Research, published by Sandia Corpora- 
tion. 


Method * 


Dials tested and equipment used. The three types 
of dials investigated were all general-purpose, multi- 
ple-turn dials on which a three-digit number could 
be set precisely. The knobs of all dials turn 10 com- 
plete revolutions to cover the range from 000 to 999 
There are no stops at the ends of the range, and 
there is some irregularity among the dials as to the 
action outside the range, but this was not considered 
important for the results 

Type I dial (Fig. 1) is a standard commercial dial 
which has two concentric scales that move past a 
pointer. The pointer, which is at the 12 o'clock po- 


sition, indicates the reading on the scales. The outer 














Fic. 1. Type I dial. This dial reads 834 


scale, which moves directly with the knob, has 100 
graduation marks, every tenth of which is marked 
by a number, from 0 to 100. The last two digits 
(right digits in a setting) are read from this scale 
The inner scale moves at about one tenth the speed 
of the outer scale and carries 10 numbers, from 0 to 
10, without graduation marks. The first digit (left 
hand digit) is read from this scale 

Type II dial (Fig, 2) is strictly an experimental 
dial, built upon the mechanism of a commercial dial 
of a different type from Type I dial. In this dial, 
the knob controls an inner concentric scale directly, 
while the outer scale moves one-fifteenth of a revo- 


2 For a more detailed description of equipment and 
procedure, see (5). 
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Fic. 2. Type Il dial. This dial reads 834. 
lution when the inner scale moves past a critical 
point (a geneva movement). The dial is covered 
and an open window put at the 9 o’clock position. 
The numbers are placed upon the moving scales in 
such a way that the first two digits of a setting are 
read directly, left to right. The last digit is read by 
noting the position of a mark on the inner moving 
scale against a fixed scale next to it. Complex as it 
may seem to describe, the operation of the dial 
proved to be quite simple. 

Type III dial (Fig. 3) is a commercial counter 
dial, with all three digits of the setting showing as 
numbers in the center of the dial, as on the mileage 
indicator of a car. The control knob is on the out- 
side or periphery of the dial mechanism. The ex- 
perimenters dubbed it the “barrel” dial, which is 
quite descriptive in that the whole outside barrel 
was turned and the numbers appeared on the sta- 
tionary “head” of the barrel. Another type of 
counter dial was also tested in the experiment with 
results very similar to results for this dial. There- 
fore the results of the two are combined in this re- 
port. 

The dials were set up on an inside wall of each of 
two booths. To one side of the dials was a clock 
which, through electric controls, recorded the inter 
vals during which a subject stood before the dials 
A camera and floodlights were mounted on the op- 
posite wall with which photographic records were 
made of the settings and checkings made by subjects 
and the time taken to make them. The experimen 
ter’s station was between the two booths, and from 
there he could operate the equipment and control 
the progress of the experiment. 

Procedure. Subjects (Ss) were 206 male students 
at The University of New Mexico. Of these 206, 40 
worked under flashlight conditions and the data of 
42 were discarded to equalize groups, as will be dis- 
cussed shortly. Thus the results given in this report 


are based upon the data from 124 Ss. Since this 
study includes several separate experiments, the num- 
bers of Ss working on the various types of dials 
differ; these numbers are shown in Table 1. 

No S operated more than one type of dial; there- 
fore, some comparison of Ss was necessary, inde- 
pendent of their record in dial operations. To ob- 
tain such a check, information about Ss, such as 
age, scholastic grade, and dial experience was ob- 
tained. Also visual tests on the Ortho-Rater, part 
of the Minnesota Clerical Test, and a Dial Booklet 
Test were given. The Dial Booklet Test consisted of 
a series of photographs of the dials being tested, 
from which Ss read and recorded the settings shown 

After Ss had taken the preliminary tests mentioned 
above, the experimenter took them to the booths 
containing the dials, explained the dials and other 
equipment to them, and gave them practice in setting 
the dials. The Ss then, working in pairs, went 
through the following steps: 

1. Each S picked up an instruction card from a 
stack of cards arranged in proper sequence, at the 
door of the booth in which he was to work first. 
The card told him to check a number on the dial. 
(This number, in the first operation, had been set 
by the experimenter.) The S then entered the booth 
and checked the dial setting, making a correction if 
necessary, then deposited the instruction card where 
it could be photographed, and left the booth. 

2. Afier photographs were taken, each S picked up 
the next instruction card from the stack and re- 
entered the booth. This time he set a new number 
on the dial and left the booth. 

3. After photographs were again taken, each S 
changed booths, picked up the appropriate instruc- 
tion card, and checked the setting of his partner. 

4. Finally, Ss set new numbers on the dials in the 
same booth before reversing again. 

The above cycle continued until each S had made 














Fic. 3. Type III dial. This dial reads 834. 
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50 settings and 50 checkings, and a few extra in case 
some within the series were for any reason lost. The 
experimental period lasted from one to two hours for 
a pair of Ss. 

In addition to the errors set on the dials, addi- 
tional errors were introduced for the checker by 
making his instructions different from those of the 
setter in 10% of the numbers he was to check. In 
these cases, although the setter may have made no 
error, the checker following him would find a set- 
ting that differed from his instruction number, and 
thus an “error” to be corrected. 

Various dials were tested under several varying 
conditions. One of these conditions was the num- 
ber of dials on a panel. Panels of one, five, and ten 
dials were made up of Type I and Type II dials. 
(There were only two Type III dials available.) In 
one series of operations the booths were darkened 
and Type III dials were set and checked under flash- 
light illumination. Actually flashlight illumination 
can be quite high at the point of reading, but it is, 
of course, uneven and requires the use of one hand 

The main results to be reported here were ob- 
tained under “normal” illumination. Under these 
conditions two factors were not separated experi- 
mentally. These factors were the height of the dial 
from the floor and the level of illumination at the 
dial. Since there was overhead lighting in the booth 
(from two sources to avoid serious shadow effects) 
the illumination automatically decreased for the 
lower dials. Dials were placed 36, 42, 48, 54, and 
60 inches above the floor. The corresponding illumi- 
nation levels were 10, 12, 15, 23, and 30 footcandles, 
respectively. It was anticipated that the effects of 
both of these factors, if any, would be to increase 
the errors on the dials nearer the floor. 


Results 


No significant differences appeared among 
the results for the factors of number of dials 
on the panel, flashlight illumination, or for 
height of dial above the floor and the accom- 
panying shift in illumination. 

Before giving results on the different dial 
types it is necessary to consider whether the 
several groups setting the different types 
showed differences on the control variables. 
Little difference was found on the following 
three: age, educational level, and near visual 
acuity. However, the group working on Type 
III dial indicated greater dial reading experi- 
ence and made fewer errors on both the por- 
tion of the Minnesota Clerical Test used and 
on the Dial Booklet Test. For this reason 
product-moment correlation coefficients were 
computed between both setting errors and 
checking errors on the one hand and the 
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above three variables on the other, taking the 
groups on the three types of dials separately. 

The correlations between dial experience and 
setting errors were — .27, + .37 and + .04; 
and between dial experience and checking 
errors were — .21, + .42 and + .03. Be- 
cause of the lack of consistency among these 
coefficients the factor of dial experience may 
be disregarded. 

The correlations between percentage errors 
on the portion of the Minnesota Clerical Test 
and setting errors were + .15, + .29 and 
+ .15 for the subjects setting the three types 
of dials; and between these percentages and 
checking errors they were — .03, — .12 and 
+ .03. The correlations between the Dial 
Booklet Test and dial operation errors were 
higher. For setting errors they were + .43, 
+ .76 and + .18:; and for checking errors 
they were + .41, + .39 and .00. 

To utilize the information obtained from 
these two control variables, the three groups 
of subjects working on the three types of 
dials'were equated for scores on both the por- 
tion of the Minnesota Clerical Test and on 
the Dial Booklet Test. This was done by 


Table 1 
Comparison of Subjects and Scores Before and 
After Equating Groups 


Before Equating 


Type IL Type Il 
Dial Dial 


Type l 
Dial 


Ss P 72 6 58 
% errors on Dial 

Booklet Test 61 64 2.6 
Y, errors on part of 

Minnesota Clerical Test 2.7 2.5 1.7 
% setting errors on Dials 6.06 2.88 1.55 
After Equating 


Typel 
Dial 


Type ll Type Ill 
Dial Dial 
Ss 52 26 46 
©, errors on Dial 
Booklet Test 3.0 3.1 
® errors on part of 
Minnesota Clerical Test 1.8 1.9 1.4 
4.96 2.31 


€ 


/ 


% setting errors on Dials 
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Table 2 
Setting Errors by Types of Dials 


Ss Settings Errors % Error 
52 2,000 129 
II 26 1,300 30 


1 46 2,300 35 


4.96 
2.31 
1.52 


Total 124 6,200 194 


discarding the data for 42 Ss with the extreme 
high or low scores on the two tests as required 
to obtain nearly equal mean scores. The ef- 
fect of this operation is shown in Table 1. 
The upper part of the table shows various 
statistics before equating groups and the lower 
part of the table shows the same statistics 
after equating. Scores on the two tests are 
fairly well equalized but the differences among 
the setting errors on the different types of 
dials remain. This is also true of checking 
errors, which are not shown. 

The results for the setting operation after 
the three groups were equated on the basis of 
the two tests are summarized in Table 2. 
The statistics in this table show that less than 
half as many errors were set on Type II dial, 
and less than a third as many on Type III 
dial, as were set on Type I dial. A Chi 
Square test of the number of subjects setting 
0, 1, and 2, and more than 2 errors on each of 
the three dials shows a significant difference 
between Type I and Type II dials at the .01 
level of confidence. This test does not sepa- 
rate Type II and Type III dials; but Type I 
and Type III dials differ at the .001 level. 

Checking errors are those errors which are 
to be located and corrected by the checker 
but which are missed by him or wrongly cor- 


Table 3 
Checking Errors by Types of Dials 
% Error 


Ss Checkings Errors 


52 2,600 103 3.96 
26 1,300 17 1.31 
46 2,300 19 83 


124 6,200 139 


Table 4 
Mean Time to Set and 


Mean 
Setting 
Time 
(sec.) 
12.2 
12.3 

98 


rected. . A checking error thus represents a 
final error when a set and check system is 
used, as in these experiments. Thus the per- 
centages of checking errors measure the over- 
all efficiency of the dial operations. These 
percentages are given in Table 3. The re- 
sults in this table bring out more strongly the 
increase of accuracy to be obtained on Type 
II and Type III dials. Only one-third as 
many errors are left on Type II dial as on 
Type I dial, and slightly more than a fifth on 
Type III dial. Because of the smaller num- 
ber of errors a Chi Square test can be made 
only on the basis of the number of Ss mak- 
ing none and Ss making one or more errors. 
Such a test shows that Type I dial differs 
from Type II and III dials at the .001 level 
of confidence, and that Type II and III do 
not differ significantly. 

Ss set and checked Type III dial faster 
than Types I and II, and significantly so. 
Table 4 gives the average time required to 
operate the dials. 


Discussion 


In this experiment, the counter-type dial 
(Type III) proved to be superior to the other 
types of dials. It was significantly better than 
Type I dial in both accuracy and speed. It 
was significantly faster to operate than Type 
IL dial. Although consistently more accurate 
than Type II dial, Chi Square tests did not 
show the differences between them to be sig- 
nificant. It is difficult to find a test that 
takes into account all the data in the experi- 
ment. It is the writers’ opinion that Type III 
dial is superior to Type II dial, as exemplified 
in the present experiment, because of the 
small numerals on the latter. 
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The results are somewhat contrary to re- 
sults indicated in previous studies (2, 4), 
which showed that the counter-type dial was 
not efficient for setting information into 
equipment. It is quite possible that the dif- 
ference of results may be accounted for by 
the effect of adding another digit to be set on 
a counter-type dial. For a four-digit num- 
ber, the knob would have to be turned 100 
revolutions, using the present gear ratios. 
For a five-digit number, it would have to be 
turned 1,000 revolutions. Probably this many 
turns would be prohibitive for the four-digit 
number and certainly so for the five-digit 
number. However, the design might be im- 
proved to overcome this difficulty. 

Type II dial did not have a fair test in this 
study because of its size limitation. The fixed 
scale and the numerals on it, from which the 
last digit was read, were below standard size. 
This led to a large number of errors of 0.1. 
If the outside dimensions of the dial were 
made 2.5 in. or larger instead of the present 
1.75 in., this type of dial probably would be 
set and checked with an accuracy approaching 
that of the counter dial. An added advan- 


tage to Type II dial is that it is of relatively 


simple construction. 

Without question, Type I dial is difficult to 
read. It has characteristics which promote 
the so-called 10 errors (3), that is, the dial is 
frequently overread by 1.0 or 10.0. Experi- 
enced operators, no doubt, learn to set and 
read this dial with considerable accuracy, but 
such an operation is nevertheless a relatively 
complex psychological feat compared to op- 
erating Type III dial. 


Summary 


Three types of multiturn dials, on which a 
range of numbers from 0 to 999 can be set, 
were tested for accuracy and speed in setting 
and checking. The results of 6,200 settings 
and 6,200 checkings made by 124 college stu- 
dents are reported. A commercial counter- 
type dial was found to be significantly more 
accurate than a commercial scale-type dial. 
Speed of operation was also significantly 
faster on the counter-type dial. An experi- 
mental-type dial of a modified scale design 
was found to be almost as accurate as the 
counter dial, although slower to operate. Be- 
cause of its simplicity it should have a useful 
place in dial-setting equipment provided its 
design is improved to bring its scale and nu- 
merals up to recommended size. 


Received August 13, 1956 


References 


. Bradley, J. V. Desirable control-display relation- 
ships for moving-scale instruments. Wright- 
Patterson Air Force Base, Ohio, WADC Tech 
Rep., 1954, No. 54-423 

Chapanis, A., Garner, W. R., & Morgan, C. T. 
Applied experimental psychology. New York: 
Wiley, 1949. 

. Kappauf, W. E. A discussion of scale reading 
habits. Wright Air Development Center, 
Dayton, Ohio, 1951. AF Tech, Rep. No. 6569. 

. Sleight, R. B. The effect of instrument dial shape 
on legibility. J. appl. Psychol., 1948, 32, 
170-188. 

Weldon, R. J., & Peterson, G. M. Factors influ- 
encing dial operation: three-digit multiple-turn 
dials. Sandia Corporation, Albuquerque, N. M., 
1955, Res. Rep. S.C.-6359 (TR). 








Journal 


of Applied Psychology 
Vol. 41, 4,4 (wert 


Validity Studies of a Proverbs Personality Test ' 


Bernard Bass 


Louisiana State University 


A preceding report (2) described the de- 
velopment of a proverbs check list. A factor 
analysis was performed on scores based on 
400 examinees’ tendencies to accept or reject 
proverbs from thirteen intermingled lists cov- 
ering thirteen needs: material comfort, sex, 
harm avoidance, achievement, affiliation, def- 
erence, autonomy, aggression, abasement, re- 
jection, nurturance, superego strength, and 
irritability. Three factors emerged: Con- 
ventional Mores (affiliation, deference, nur- 
turance, superego strength), Hostility (au- 
tonomy, rejection, irritability) and Fear o° 
Failure (achievement, harm avoidance). In 
homogeneous samples, the factors were orthog- 
onal. Newly keyed, thirty-item scales using 
new samples yielded corrected split-half re- 
liabilities of .73, .69 and .75 for the three 
factor scales. 


Occupational and Educational Differences 


Concurrent validity analyses were performed 
by examining mean differences between edu- 
cational and occupational groups on each of 
the three scales. We expected that salesmen, 
compared to other occupational and educa- 
tional groups, would be more conforming, 
less hostile, and in greater fear of failure or 
need for achievement. We expected peni- 
tentiary prisoners to be more hostile and less 
conforming than average, and psychopathic 
prisoners to be even more so. We expected 
college students to be less conforming than 
high school students. Finally, we expected 
Southern samples to be more conforming and 
relatively lower in fear of failure. 

Twenty samples were available including 
282 salesmen, 78 factory supervisors, 49 pub- 
lic school teachers, 147 penitentiary inmates, 

1 This study was aided by a grant from the Louisi- 
ana State University Graduate Council on Research. 
The author was assisted in the analyses by Ki Suk 
Kim, George Palmer, Austin Flint, and Margaret 
Wakefield. He wishes to thank Donald T. Camp- 
bell, Cecil Gibb, Gerald McCullough, Arnold Gebel, 


and Herbert Rothschild for their help in data col- 
lection, 
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34 Marine Corps enlistees, 361 college stu- 
dents, 36 student nurses, and 234 high school 
students. F tests indicated that the samples 
varied significantly at the 1% level on all 
three scales. A number of ¢ tests were run 
between selected samples using as an esti- 
mate of error the pooled within-cells estimates 
of population variance from all the occupa- 
tional samples. 

Salesmen. The seven samples of 282 sales- 
men tended to be higher than any other 
groups in conventional mores and fear of 
failure, and lowest in hostility. The mean 
differences were as follows: 

1,191 
283 Non- 


Salesmen salesmen 


p 


47.9 
31.5 
39.9 


42.9 
37.1 
36.2 


Conventional Mores 
Hostility 
Fear of Failure 


<.01 
<.01 
<.01 


Penitentiary inmates. Inmates as a whole 
tended to be lower in conventionality, higher 
in hostility, and higher in fear of failure. Ex- 
pected differences between psychopaths and 
normal prisoners did not emerge according to 
an analysis by Palmer (6) who also found 
that prisoners could not be significantly dis- 
criminated from rural high school stucents 
matched in educational level and geographi- 
cal region of residence. 

1,327 
Non 


inmates 


147 
Inmates p 
44.0 
35.7 


36.8 


Conventional Mores 
Hostility 


Fear of Failure 


<.01 
<.01 
<.01 


Educational differences. High school stu- 
dents scored significantly higher on all three 
scales than college students. These differ- 
ences, in part at least, were probably due to 
the significantly greater tendency of high 
school students to acquiesce to any generali- 
zations about behavior (1). The means were 
as follows: 





Validity Studies of a Proverbs Personality Test 


520 234 
College High School 
Students Students p 


41.0 
35.4 
34.4 


Conventional Mores 
Hostility 
Fear of Failure 


46.3 <1 
49.9 <.01 
414 <.01 


Regional differences. Significant regional 
differences in conventional mores and _ hos- 
tility were observed also when Southerners 
from various occupational and educational 
levels were compared with a pool of subjects 
from the Northeast, Midwest, and West 
Coast. Differences expected in fear of fail- 
ure did not materialize. 

1,009 369 


South Non 
erners Southerners p 


43.8 
36.7 
36.8 


43.1 
34.6 
36.8 


<.05 
<.01 
>.05 


Conventional Mores 
Hostility 
Fear of Failure 

The differences were even more apparent 
when specific occupational or educational 
groups drawn from different regions were 
compared. For example, 75 Southern sales- 
men exhibited a mean of 49.8 in conventional 
mores in contrast to a mean of 47.4 for 128 
non-Southern salesmen working for the same 
company. Southern college students earned a 
significantly higher mean on this scale than 
Midwestern college students. Similar spe- 
cific differences were found in hostility, but 
not in fear of failure. 

These regional differences cannot be at- 
tributed only to differences in social acquies- 
cence, for differences occurred on only two of 
the three scales. 

Other occupational differences. Supervisors, 
teachers, Marines, and student nurses ap- 
peared to “line up” in a rational hierarchy. 
For example, nurses were significantly more 
conventional than Marines; teachers with 
greater job tenure were significantly lower in 
fear of failure than salesmen with little job 
tenure. 


Conventional Mores 


(47.9) 
(45.8) 
(45.0) 
(44.1) 
(41.7) 


Salesmen 
Nurses 
Supervisors 
Teachers 
Marines 


Hostility 


(39.2) 
(36.3) 
(35.9) 
(33.1) 
(31.5) 


Nurses 
Marines 
Teachers 
Supervisors 
Salesmen 


Fear of Failure 


(39.9) 
(36.0) 
(35.7) 
(34.2) 


($2.7) 


Salesmen 
Supervisors 
Marines 
Nurses 


Teachers 


On the other hand some differences cannot 
be explained, such as the high mean hostility 
among the nurses. 

Academic overachievement. Kim (5) found 
that none of the three scales added signifi- 
cantly to the ACE in the accuracy of dis- 
criminating among college students in scho- 
lastic success. 


Interrelations with Other Tests 


Conventional mores. We expected this 
scale to correlate with various measures of 
social interest and sociability. The scale sig- 
nificantly correlated .43 with the Guilford- 
Zimmerman Temperament Survey Sociability 
score, — .41 with the UCPOC Ethnocentrism 
scale, and .31 with the G-Z cooperativeness 
score. On the other hand, it failed to corre- 
late significantly with selected measures of 
ascendency on the Guilford-Zimmerman, con- 
sideration as measured by the Ohio State 
Leadership Studies leadership opinion ques- 
tionnaire (4), friendliness (G-Z), and phari- 
saic virtue as derived from the MMPI (3). 

Hostility. This scale correlated — .25 with 
the G-Z emotional stability scale, yet only .04 
with the Gordon Personal Profile measure of 
the same trait and 30 with the Gordon 
measure of responsibility (p < .01). A cor- 
relation of .24 was obtained with an MMP'I- 
derived measure of hostility (3). Hostility 
correlated .20 with the UCPOC F scale, but 
this could have been due to the acquiescence 
involved in both. 

Fear of failure. This scale was unrelated 
to any of the assessment techniques examined 
which included the Guilford-Zimmerman, Gor- 
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Table 1 


Validity of Three Scales for Predicting Success as a 
Salesman and Sales Supervisor 


Conven- 
tional Fear of 
Sample Mores Hostility Failure 
62 Northern salesmen —.17 —.20 — 33** 
66 Northern salesmen ~ Al 08 - 09 
33 Southern salesmen 10 — 14 04 
42 Southern salesmen 10 Ol O1 
34 Sales supervisors 07 .25 09 
28 Sales supervisors | —.17 14 


“p< Ol, 


don Personal Profile, the Kerr Empathy Test 
and leaderless group discussions. 

All three scales were uncorrelated with the 
Wonderlic intelligence test and only Hostility 
exhibited a significant correlation (.29) with 
the ACE. 


Predictive Validity Studies 

The following correlations were obtained 
between the three scales and success of 53 
factory supervisors two years later as meas- 
ured by forced-choice merit ratings: Conven- 
tional Mores, — .14; Hostility, — .12; and 
Fear of Failure, — .26. For 51 df, a correla- 
tion of — .27 is significant at the 5% level. 

Table | lists correlations found between the 
scales and rated success as a salesman or sales 
supervisor from three to six months after test- 
ing. Sales and supervisory merit ratings in 
both studies had reliabilities ranging from .6 
to .9. Only one validity coefficient reached 
statistical significance at the 1% level. While 
salesmen differed significantly from the gen- 
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eral population in their three mean scale 
scores, there was little or no relation between 
their scores and rated success as salesmen. 


Summary 


This report reviews correlational analysis of 
a three-factor list of proverbs from the Fa- 
mous Sayings Test. Low positive correlations 
with corresponding measures and educational 
and occupational group differences tend to 
sustain the supposition that three content 
factors—Conventional Mores, Hostility, and 
Fear of Failure—are assessed by the three 
scales. The factors, as measured, do not ap- 
pear related to academic overachievement, 
psychopathy, or success as a salesman. How- 
ever, the fear of failure scale may have some 
utility for the forecasting of success as a fac- 
tory supervisor. 


Received August 13, 1956. 


References 


1. Bass, B. M. Development and evaluation of a 
scale for measuring social acquiescence. J. ab- 
norm. soc. Psychol., 1956, 53, 296-299. 

2. Bass, B. M. Development of a structured dis- 
guised personality inventory. J. appl. Psy- 
chol., 1956, 40, 393-397. 

3. Cook, W. W., & Medley, D. M. Proposed hos- 
tility and pharisaic-virtue scales for the 
MMPI. J. appl. Psychol., 1954, 38, 414-418. 

4. Fleishman, E. The measurement of leadership 
attitudes in industry. J. appl. Psychol., 1953, 
37, 153-158. 

. Kim, Ki Suk. The use of the ACE and the Re- 
vised Famous Sayings Test in the prediction 
of academic achievement. Unpublished mas- 
ter’s thesis. Louisiana State Univer., 1955. 

6. Palmer, G. A. Discrimination of psychopaths, 
normal prisoners and non-prisoners using a 
disguised objective personality test. Unpub- 
lished master’s thesis. Louisiana State Univer., 
1956. 


wn 








Journal of Applied Psychology 
Vol. 41, No. 3, 1957 


Colored Stationery in Direct-Mail Advertising 
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Colors are being added to almost every 
product, not only to focus attention on the 
item itself, but also to satisfy the individual 
consumer’s desire to be associated with ob- 
jects appealing to his superior tastes. This 
observation is evidenced by the “new look”’ in 
automobiles of two-tone and even three-tone 
pastels; colored detergents for the housewife, 
pastel-colored typewriters for the secretary, 
etc. (7). 

Similarly, by appealing to the prospect’s 
sensory and emotional faculties, colors have 
contributed to the task of increasing the ef- 
fectiveness of direct-mail advertising. Real- 
izing this, the management of the Government 
Personnel Mutual Life Insurance Company, 
which deals almost exclusively with members 
of the armed services, decided that it would 
like to use more color in letters to prospec- 
tive buyers, providing, of course, color actu- 
ally produces better results for the company 
than does the conventional white stationery. 
Therefore, GPM decided to conduct two ex- 
periments testing the effectiveness of color in 
its direct-mail advertising. It was assumed 
that one color might be better than another; 
also that combinations of colors might be im- 
portant; and that where the color was placed 
within the mailing might be important. 

The following report considers the problem, 
the method of experimentation, the results, a 
discussion of the findings, and a brief sum- 
mary of the entire report. 


Problem 


The problem is best stated in terms of a 
set of questions. 

1. Can the use of color in direct-mail ad- 
vertising result in a higher response from con- 
sumers? This question is logical since color 
has been found to have psychological influ- 
ence and symbolic significance (2, 3). 

2. Is the response from a mailing related to 
the combination of colors within the mailing? 
(2). A color, by itself, may not stimulate re- 


sponse, but when in combination with an- 
other color, a favorable or unfavorable re- 
sponse may result. 

3. Does the interaction between combina- 
tions of colors and their position within a 
mailing affect response? It may be that cer- 
tain combinations of colors within a mailing 
are more effective when placed in one posi- 
tion rather than in another. 

4. Does the difference in seasons, as re 
flected by a three-month period, affect the 
response from color or their combinations? 
Colors found to be highly desirable during 
one season can be less desirable during an 
other (3). 

As a result of these questions, this hypothe- 
sis is formulated: “Colors and their combina- 
tions will have a varied effect on the success 
of direct-mail advertising. Also, the response 
will be influenced by the combinations of 
colors and their placement within the mail- 
ing. In addition, the returns will be affected 
by seasonal variation.” 


Method 


The method employed is best stated by describing 
the design of the experiments, the materials used, 
the subjects, the data, the sampling and control de 
vices, and the procedure for the analysis of the data 


Design of the Experiment 


Colors were selected in accordance with the find- 
ings of previous experiments (3). That is, mostly 
those colors which had been found to yield high re- 
turns were considered. They were: white, blue, 
green, yellow, pink, and goldenrod. Testing the ef- 
fectiveness between simple combinations of the six 
colors was prohibited due to the expense involved in 
testing so many configurations. 

A 3X3 matrix design allowed for three colors to 
be tested against one another in one dimension and 
three against each other in the other dimension 
Distasteful combinations were eliminated by placing 
the two colors involved in the same dimension 
Conveniently, colors white, blue, and green occupied 
the rows, while yellow, pink, and goldenrod were 
placed in the columns. Therefore, no obvious dis- 
tasteful combinations, such as goldenrod and pink, 
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were possible; but by this arrangement, several fa- 
vorable combinations were necessarily eliminated. 

The next primary source of variation in the ex- 
periment was the placement of the color within the 
mailing. The mailing was designed so that color 
combinations would reciprocate between the out- 
going envelope and the return pieces. The return 
pieces included a letter and a return envelope. There- 
fore, the 3 X 3 design was expanded to a 3 X 3 X 2 
design, which allowed for varying the position of the 
configurations of colors within the mailings. 

Each experiment consisted of testing mailings for 
three consecutive months. This helped reduce ex- 
perimental error and provided a measure of the sea- 
sonal effects. 


Materials 


The stationery was manufactured from twenty- 
pound-weight paper. Both the outgoing and the re- 
turn envelopes were designed from paper of these 
colors: White Elm, Blue Aspen, Green Aspen, Yel- 
low Birch O'Paque, Cherry Aspen, and Goldenrod 
Hammermill. The letterheads were cut from Hamil- 
ton Bond and named: White, Blue, Green, Canary, 
Pink, and Goldenrod. In addition, the printing of 
the letterheads was done with black ink. The di- 
mensions of the materials were: outgoing envelope, 
8 34 in.; return envelopes, 74 * 344 in.; and letter- 
heads, 104 X 74 in. 


Subjects 


The subjects were military personnel on active 
duty in the Navy and Air Force. However, more 
Navy than Air Force personnel were included in this 
group. The first experiment was conducted using 
officers as subjects while the second experiment in- 
volved enlisted men. They were selected by repre- 
sentatives of the Government Personnel Mutual Life 
Insurance Company because they were likely pros- 
pects for the purchase of the insurance plans spe- 
cifically mentioned in the letters. The subjects were 
located in states including Maryland, Virginia, Mas- 
sachusetts, Pennsylvania, Texas, Colorado, New 
Mexico, Kansas, Florida, North Carolina, Georgia, 
California, Washington, and Washington, D. C. 


Data 


The data for both experiments consisted of the 
number of returns and nonreturns from letters mailed. 

The first experiment on officers was conducted dur- 
ing July, August, and September of 1955. Letters 
mailed during July totaled 1,620, during August 
3,109, and 3,898 during September. The experiment 
involving enlisted men was conducted during Feb- 
ruary, March, and April of 1956. The number 
mailed during February totaled 1,652, during March 
1,176, and 601 during April. 

The allowable period during which letters could be 
returned lasted from the date of the mailings until 
the end of the next calendar month. The letters 
were mailed daily within each month. The data 


Donald H. Bender 


representing the outcome of the experiments was 
separated according to the months during which the 
letters were mailed. 


Sampling and Control Devices 


The random assignment of a particular treatment 
to a subject was permitted by a table of random 
numbers taken from Snedecor (6). 

Since all of the combinations of colors within the 
matrix design involved at least one color other than 
white, a control on color was permitted by adding a 
combination, white-white. Therefore, white-white 
represented no color and formed the basis of meas- 
uring the effects of all color and individual combina- 
tions of colors by themselves. 

While the copy for the officer experiment was 
typed automatically permitting a first copy letter, 
the written matter for the enlisted experiment was 
printed. The copy was typed or printed in black 
using elite Gothic double-case style. 


Techniques of Measurement 


The techniques of measurement used for both ex- 
periments were identical. The objective measure- 
ment of the relationship between the variation of 
response and the variation of color configurations 
was computed by Chi Square (x?) and Analysis of 
Variance significance tests (4, 5, 6). 

The Chi Square test was used to measure the dif- 
ference in response between white-white, all color, 
and specific color combinations placed a particular 
way within a mailing. The Analysis of Variance 
was used to measure the significance of interactions 
involving the primary source of variation, that is, 
returns versus nonreturns. 

The results of the experiments are indicated by 
the significance tests 


Results 


In this section, tables are included which 
portray the results of both experiments." 

It is illustrated in Table 1 that, although 
colored mailings as a group yielded higher re- 
turns than white-white, the difference is not 
great enough to be significant. Typically, 
chi square is mot used in contrasting a se- 
lected statistic with a control. However, in 
Table 2 and Table 3 chi square was used as 
a practical test for all combinations which 
might have produced a yield significantly 
higher or lower than white-white. 

Table 2 shows that the best producing 
combinations—blue-pink for the officers and 


1In the tables, specific combinations of colors are 
indicated by stating the color of the outgoing en- 
velope followed by a hyphen and then the color of 
the return pieces. 
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Table 1 
A Comparison Between All-Colored 
Mailings and White-White 


No % 


Experiment Variables Mailed Returns 


Officers All Color 


White-White 


8.177 
450 


9.2 
&.9 
Enlisted Men All Color 


White-White 


3,057 
172 


7.8 
64 


white-pink for the enlisted men—were found 
to be insignificantly different from their con- 
trol white-white. Therefore, no combination 
resulted in a yield greater than white-white 
that could not have occurred by chance. 

Table 3 indicates that chance could have 
accounted for combinations producing less 
than white-white. 

An Analysis of Variance was computed for 
each experiment to measure the significance 
of returns versus nonreturns by each of four 
variables: white, blue, and green; yellow, 
pink, and goldenrod; reciprocal placement of 
color combinations; and the three months 
during which the letters were mailed. In 
evaluating the difference between returns 
versus nonreturns, the effects of the interac- 
tions of these four independent variables were 
also measured.” 

The only relationship found to be sig- 
nificant was the interaction between returns 


Table 2 
A Comparison Between the Best Yielding 
Combinations and White-White 


No. % 


cf 
Experiment Variables Mailed Return 


Blue-Pink 
White-White 


Officers 451 


450 


11.5 
&.9 

White-Pink 

White-White 


172 
172 


11.6 
64 


Enlisted Men 


2 The table illustrating the results of each Analysis 
of Variance has been deposited with the American 
Documentation Institute. Order Document No. 5127 
from ADI Auxiliary Publications Project, Photo- 
duplication Service, Library of Congress, Washing 
ton 25, D. C., remitting in advance $1.25 for micro 
film or $1.25 for photocopies. Make checks payable 
to Chief, Photoduplication Service, Library of Con 
gress. 
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Table 3 
The Combinations Yielding the Poorest Return 
Compared with White-White 


No. % 


ie 
Experiment Variables Mailed Return 


Officers Pink-White 
White-White 


400 
450 


6.7 
&9 
Enlisted Men 172 
172 


5.2 
OA 


Yellow-Green 
White-White 


versus nonreturns and the three months the 
letters were mailed. This finding was con- 
sistent in both experiments. The monthly 
variation during the officer experiment was: 
July, 9.4%; August, 9.89%; and September, 
8.6%. The enlisted experiment produced 
9.0% from February, 5.7% from March, and 
8.4% from April. 

The results from objective measurement 
clearly indicate that the hypothesis ‘Colors 
and their combinations will have a varied ef- 
fect on the success of direct mail advertising, 
and also that the response will be influenced 
by the combinations of colors and their place- 
ment within the mailing,” was not proven to 
be true in regard to the particular needs of 
the Government Personnel Mutual Life In- 
surance Company. The hypothesis “The re- 
turns will be affected by seasonal variation”’ 
was substantiated. 


Discussion 


After the results of the first experiment on 
officers were observed, it was thought that 
possibly their educational and maturity level 
was inhibiting the effects of the color vari- 
able (1). This led to the experiment using 
enlisted men as subjects. 

It is interesting that the results of these ex- 
periments conflict so sharply with the results 
measured from other sources. Certainly, in 
other fields of business the trend is toward 
the use of more colorful products and dis- 
plays. The findings observed here do not 
imply that all color in direct mail is wasted. 
Instead, one conclusion might be that the use 
of color in direct mail and other forms of 
advertising should be evaluated more objec- 
tively. 





Summary 


In summary, the Government Personnel 
Mutual Life Insurance Company set out to 
see if it could improve its response to direct- 
mail advertising through the use of colored 
stationery, combinations of colored pieces, 
and placement of colors in the mailing. 

In an effort to evaluate the problem, two 
experiments were conducted, one involving of- 
ficer personnel and the other using enlisted 
men. The results indicate that the response 
to direct mail in this company cannot be im- 
proved through colored stationery. 

Since the results conflict with most of the 
common beliefs about this variable, the value 
of using color in other direct-mail advertising 
should be reviewed more objectively. 


Received July 26, 1956. 
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The design of this study was such as to 
permit a comparative analysis of the learning 
of a stimulus-word list of 25 nouns presented 
under three different conditions, here desig- 
nated as a, b, and c. These conditions were: 
a, words alone; 6, words presented simultane- 
ously with their uncolored pictorial represen- 
tations; c, words presented simultaneously 
with their colored pictorial representations. 
In outlining the theoretical rationale of this 
study, we shall refer to both the words and 
their corresponding pictures as signs of the 
objects they denote. In these terms, Condi- 
tion a invoived the minimal number of signs. 
Next came Condition } with the addition of 
black-and-white pictorial representations of 
the form and structure of the objects de- 
noted by the words. Condition ¢ involved 


the largest number of signs by virtue of the 


addition of color to the uncolored outlines of 
form and structure. This arrangement pro- 
vided a test of the following hypothesis: With 
the number of presentations for the learning 
of stimulus words held constant, the number 
of these words recalled by Ss should vary 
positively within limits with the number of 
simultaneously presented additional signs. In 
other words, from the standpoint of potency 
to induce recall, the experimental conditions 
should vary so that a << b <c. 

Several types of derivation of our experi- 
mental hypothesis are available. What ap 
pears to us to be the simplest, and the one we 
shall outline here, is based on an application 
of the model of classical conditioning. First, 
we assume our Ss entered the experiment with 
more or less well-established habits of writing 
the stimulus words in response to their corre- 


1 This paper is based on Technical Report No. 18 
under Contract Nonr-631 (00) between the Office of 
Naval Research and the University of Connecticut 
Reproduction in whole or in part is permitted for 
any purpose of the United States Government. 


sponding signs. In order to avoid what we 
regard as the cumbersomeness of orthodox 
terminology, we shall use the convention of 
treating these already established motor re- 
sponses of writing the words as unconditioned 
responses to the signs which functioned as un- 
conditioned stimuli. In the experiment the 
Ss were told to write as many of the stimulus 
words as they could recall on prepared data 
sheets. As a consequence of these instruc- 
tions they learned to connect the data sheets, 
which were in effect conditioned stimuli, with 
the unconditioned responses of writing the 
words. Within the framework of these as- 
sumptions, we may consider the predicted 
consequences of the compounding of signs. 
We base our experimental hypothesis on the 
results of a study of Weber and Wendt (3) 
on the conditioning of eyelid closure to a sud- 
den increase in a fixated light. For effective- 
ness in the development of the conditioned re- 
sponse, the compound unconditioned stimulus 
of a loud sound and a puff of air was better 
than either the loud sound or the puff of air 
alone. As we have already indicated, our 
three experimental conditions were such as to 
involve varying degrees of compounding of 
signs. The signs, moreover, functioned as un- 
conditioned stimuli. We may speculate that 
at some point an increase in the multiplicity 
of a compound unconditioned stimulus should 
result in diminishing returns for the estab- 
lishment of conditioned responses. To the 
degree that this speculation is valid, it is 
necessary to assume a limit to the generality 
of the application of our experimental hy- 
pothesis. 


Method 


The Ss were 165 University of Con- 
necticut students enrolled in six laboratory sections 
in introductory psychology ranging in size from 20 
to 39 members. Three of the six sections, here 
designated as the “Campus Ss,” comprised students 


Subjects 


165 
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Table 1 


Stimulus 


Meaning 


Categories White Yellow 


Birds gull canary 
Fruits melon banana 
Flowers lily daffodil 
Nature snow desert 
Vegetables onion squash 


in residence. The remaining three sections, here 
designated as the “Branch Ss,” comprised students 
enrolled in one of the University Branches. 
Stimulus items and apparatus. The 25 stimulus 
words used in this study were nouns selected so as 
to be classifiable either conceptually according to 
generic categories of meaning or perceptually accord- 
ing to the characteristic color of the objects they 
denoted. These words appear in Table 1. For Con- 
dition a, the words were typed on Radio-Mats from 
which 35-mm. slides were made. For the pictures 
used in Conditions b and c, the services of an artist 
were employed who first made realistic black-and- 
white sketches of the objects denoted by the words. 
These sketches were then filled in with their appro- 
priate colors according to the categories shown in 
Table 1. The titles, ie., the stimulus words, were 
printed below the pictures. A professional photog 
rapher made four photographs of each picture on 
35-mm. film—two in black and white and two in 
Ektachrome. Slides made from these photographs 
provided the stimulus materials for Conditions b and 
c. A standard Selectroslide projector, set so as to 
give 24 sec. of exposure per item, was used for pro- 
jecting the stimulus materials on a screen. Mimeo- 
graphed data sheets were prepared for use by Ss in 
writing the words they were able to recall 
Procedure. Two Es conducted the experiment. 
One used the three groups of Campus Ss—one group 
for each of the experimental conditions—and_ the 
other similarly used the three groups of Branch Ss. 
Each E proceeded as follows: He first distributed 
the data sheets to Ss and then read the instructions 
The Ss were told that a list of words (with their 
corresponding pictures for Conditions b and c) would 
be projected on the screen; after two complete pres- 
entations of the words, and at a signal from E, they 
were to write on the data sheets as many of the 
words as they could recall, and write the words in 
the order of their occurrence in memory. The items 
were then presented in two random orders which 
were prepared from a table of random numbers 
The signal to start writing was given 3 sec. after the 


2 These words were used in a study of associative 
clustering. The present report deals with one of the 
several analyses of the data. 


Words 


Color Categories 


Red Purple 


parrot cardinal martin 
lime apple grape 
fern rose aster 
meadow sun mountain 
pepper radish eggplant 


completed second presentation of the items. Ten 
minutes were allowed for recall. 


Results 


In scoring the data of the individual Ss we 
recorded two measures. These were the num- 
ber of stimulus words correctly recalled and 
the number of errors. The latter included 
words not presented in the stimulus list, 
stimulus words listed more than once, and in- 
decipherable words. We first analyzed sepa- 
rately the data of the 91 Campus Ss and the 
74 Branch Ss as they related to the three ex- 
perimental conditions. The probability from 
Bartlett’s test of homogeneity of variance, as 
applied to the two sets of data for the three 
experimental conditions, was between .50 and 
30. We believed this justified a pooling of 
the data, and we shall here report the pooled 
results. 

Table 2 shows the general nature of the 
data as they relate to the number of stimulus 
words recalled in the three experimental con- 
ditions. The differences between the means 
are significant at the following levels of con- 
fidence: a and b, .005; a and c, .0005; b 
and c, .05. The results of the analysis of 


Table 2 


Number of Stimulus Words Recalled for the 
Three Experimental Conditions 


Condition* No. Ss Mean SD SEn 


67 15.55 3.29 AO 
58 17.97 2.52 33 
40 19.15 2.60 A2 


* a, words alone; 6, words plus uncolored pictures; c, words 
plus colored pictures. 
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Table 3 


Analysis of Variance for Number of Stimulus 
Words Recalled 


Mean 
Square 


Source of 


Variation Variance df 


Total 1730.73 164 

Between 
Conditions 

Between 


376.58* 188.29 22.3i°° 


Populations 21.50* 21.50 23" 
Interaction Bg 2 16 
Within 1341.80 159 8.44 


* Corrected variance. 
oP < Oi. 
=P > OS. 


variance appear in Table 3. We employed 
Snedecor’s (1, pp. 289-290) design for dis- 
proportionate subclass numbers in an R X 2 
table. This design was used because the 
groups were intact and had unequal Ns. It 
gives an independent measure of interaction 
between the variables and a correction for 
disproportionality. The analysis showed that 
the variance between the experituental condi- 
tions was significant (p < .00i). The inter- 
action was negligible and the variance be- 
tween populations was not significant. The 
correction for disproportionality was actually 
unnecessary since the differences were suffi- 
ciently large to be significant without this 
correction, 

The analysis of the errors indicated that 
their incidence was relatively low and there 
were no reliable differences in their frequen- 
cies of occurrence for the three experimental 
conditions. The means of the error scores 
were as follows: 0.61 for the 67 Ss in Condi- 
tion a with the words alone; 0.62 for the 58 
Ss in Condition 6 with the words plus their 
uncolored pictures; 0.30 for the 40 Ss in Con- 
dition ¢ with the words plus colored pictures. 
A trend in the direction of the hypothesis, 
though not a significant one, was implied in 
the percentages of total responses which were 
errors. They were as follows: Condition a, 
3.77; Condition b, 3.34; Condition c, 1.54. 


Discussion 


According to our interpretation, this study 
confirmed the experimental hypothesis which 
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predicted an increase in the recall of stimulus 
words occurring as a consequence of an in- 
crease in the number of simultaneously pre- 
sented additional signs of the objects denoted 
by these words. Our discussion will be lim- 
ited to two considerations: first, the relation- 
ship between our findings and those of a re- 
lated experiment; second, the nature of the 
needs for further research on the compound- 
ing of signs. 

VanderMeer (2) compared the effectiveness 
for learning of the use of color films and the 
same films in black and white. These moving 
pictures were in both cases accompanied by 
running commentaries. He concluded that 
the factor of color resulted in no significant 
gain for the immediate acquisition of learn- 
ing. On the other hand, the results suggested 
that the use of color resulted in some degree 
of reduction in the rate of forgetting. His 
general conclusion was that the use of films 
which apparently “call for color” does not ap- 
pear justifiable for promoting more effective 
learning. In accounting for his somewhat 
negative results, VanderMeer speculated that 
the impact of color may have diverted atten- 
tion from other important cues for learning. 
Furthermore, as he points out, his color films 
may not have presented his Ss with the range 
of colors typically associated with the objects 
presented. Within the framework of the the- 
ory considered in the present study, we are 
inclined to speculate that VanderMeer’s find- 
ings may be explained along the lines we have 
indicated. The multiplicity of the cues pre- 
sented in his color films may have exceeded a 
critical level for effectiveness of compounding. 
Furthermore, to the degree that the colors in 
the films differed from those typically asso- 
ciated with their corresponding objects, there 
would be an effect of interference. Current 
concern over the use of visual aids in learn- 
ing situations indicates the need for further 
research in this area, and our analysis sug- 
gests the nature of some of the factors requir- 
We would further submit, how- 
ever, that the basic theoretical problem in un- 
dertakings of this type can well be phrased 
as that of the compounding of signs. In de- 
fining the nature of the types of basic ex- 
perimentation for programmatic purposes, we 


ing control. 
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would propose that compounds of signs may 
be purely verbal, purely nonverbal, or mixed 
as was the case in the present experiment. 
On the other hand, the associated responses 
may not only be those of the motor-habit 
type, but may include measurable autonomic 
responses as well. 


Summary 


The experiment permitted a comparative 
test of the immediate retention of 25 stimulus 
words, all of which were nouns, when they 
were presented twice for learning under three 
conditions as follows: a, words alone; 5, 
words presented simultaneously with their un- 
colored pictures; c, words presented simul- 
taneously with their colored pictures. These 


three conditions were interpreted as represent- 
ing varying degrees of the compounding of 
signs of the objects denoted by the words. 
The results were regarded as giving support 
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to the following experimental hypothesis: 
With the number of presentations for learn- 
ing of stimulus words held constant, the num- 
ber of these words recalled by Ss should 
vary positively within limits with the num- 
ber of simultaneously presented additional 
signs. The theoretical rationale of the study 
was based on the model of classical condi- 
tioning. 


Received August 6, 1956. 
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Some Correlates of Attitudes on the 1956 Steel Strike 
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The ensuing research was conducted during 
the nationwide steel strike of July, 1956, in 
the Illinois-Indiana Calumet steel district. 
Purpose of the investigation was to ascertain 
certain strike attitudes and their relation- 
ships with certain objective data about the 
strikers themselves. 

A questionnaire was constructed measur- 
ing the “agree—disagree”’ reactions toward 23 
attitudinal statements and one final, general 
strike attitude statement (“Do you feel that 
a strike at this time is worth the gains that 
may result?”), plus seven items of back- 
ground information about the striker respond- 
ent. This form appeared brief, being all on 
one page. The 23 attitudinal items were di- 
vided into three blocs consisting, respectively, 
of nine pro-strike, nine anti-strike, and five 
neutral statements. 


Procedure 


The data were obtained by personal interview in 
the East Chicago, Illinois, and Indiana Harbor, 
Indiana, steel community area. The questionnaire 
was presented as a nonpartisan worker opinion re- 
search project of the Illinois Institute of Technology 
with all replies to be held in strict anonymity and a 
summative report of the results to be submitted to 
both parties to the Strike. Data were obtained from 
three principal types of respondents: 70 strike pick- 
ets; 13 strikers lounging in bars and taverns; 39 
strikers in their homes. These 122 cases appear rela- 
tively representative except for inadequate sampling 
of workers in their homes. It is probable that while 
the full continuum of striker attitudes is present 
here, the sample contains a disproportionately large 
number of the more militant strikers. But since the 
purpose of this study is primarily the determination 
of relationships rather than “proportions of opinion,” 
the results should be but mildly influenced by the 
small representation of home-call cases. 

A total strike attitude score was computed by 
summating all pro-strike responses (among the first 
18 items) and by subtracting from this total all 
anti-strike responses (among the same first 18 items), 
making a possible attitude score range of — 18 to 


18. This score, then, on each of 122 strikers was cor- 
related with the specific variables shown in Table 1. 


Results and Discussion 


Inspection of the Table 1 biserial coeffi- 
cients reveals that the individuals most favor- 
able to the strike tended to be married, to 
have children, to have had past strike experi- 


Table 1 
Biserial Correlations Between Total Strike Attitude 
and Each of Eight Striker Characteristics 
in the 1956 Steel Strike 
(N = 122) 


Biserial 
Question r 


Are you married? 31 
Do you have children? 36 
Have you ever been on strike before? 34 
Do you think this strike will last two months? 
Are you buying appliances or automobile on 
installment ? 
Are you less than 36 years of age? 
Did you have more than 2 years of high school ? 
Do you feel that a strike at this time is worth 
the gain that may result? 


ences, and to have had less than two years of 
high school. Variables not significantly re- 
lated with strike attitude were opinion on how 
long the strike would last, buying appliances 
or automobile on installment, and age of the 
striker. The installment-buying result con 
tradicts U. S. News and World Report (1) 
magazine's stated assumption that installment 
buyers were not favorable toward the strike. 
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1. Labor's four years of peace. U 
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Conflicting Principles in Man-Machine System Design * 
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and John T. Lanzetta 


Fels Group Dynamics Center, University of Delaware 


An almost universal characteristic of com- 
plex man-machine systems, such as ships, 
multi-engine planes and certain manufactur- 
ing processes, is the physical dispersion of 
critical information sources. In such systems 
it is virtually impossible for a single indi- 
vidual to monitor more than a fraction of all 
the relevant input displays, even with heavy 
instrumentation. Since any adjustment or 
control action may require information as to 
several system variables, such systems are 
often manned by teams in which each person 
serves both as an operator, or control agent, 
and as an observer, or information source, for 
himself and other control agents. 

In previous experiments (3, 4, 6) an at- 
tempt has been made to determine organiza- 
tional principles governing the effectiveness of 
this type of system. In general, it has been 
shown that, other things being equal, the 
more “autonomy” attaching to display-con- 
trol actions the better is the performance ob- 
tained. That is, the optimal arrangement of 
displays and controls is one in which each 
person who needs certain classes of informa- 
tion for making control actions is himself the 
primary source of that information. In ad- 
dition, it was determined that if information 
relating to a control must be relayed, it is best 
if it is relayed from a single source rather than 
from several sources. 

While these principles seem plausible enough, 
and could perhaps be applied directly to sys- 
tem design, it has been pointed out (5) that 
there are dangers in any single principle ap- 
proach to this problem. Examples may readily 
be constructed in which a single-minded ad- 


! This study was performed in support of Project 
7713 under the Air Force Personnel and Training Re- 
search Center, Lackland Air Force Base, San An- 
tonio, Texas. Permission is granted for reproduction, 
translation, publication, use and disposal in whole 
and in part by or for the United States Government. 


herence to the “autonomy” principle would 
lead to a violation of other principles which 
have fully as much a priori plausibility. 

One such alternative principle is that of 
“load balancing’”—the notion that the total 
work of the team should be distributed as 
evenly as possible. This principle may be 
defended on the basis of both human engi- 
neering and sociopsychological considerations. 
On the other hand, if certain physical con- 
straints make it essential for one or more per- 
sons to have responsibility for a dispropor- 
tionate number of control actions, the assign- 
ment of responsibilities for observations must 
sacrifice either autonomy or load balancing. 
If the “overloaded” persons are assigned all 
observations which relate to their own con- 
trol actions, this leads to even grosser over- 
loading; if observations relating to their con- 
trols are assigned to less heavily burdened 
team members, autonomy is violated. 

The present experiment was designed to in- 
vestigate this antinomy in system design. An 
additional feature of the experiment is the in- 
troduction of several levels of over-all “load” 
as defined by the rate of change of system 
variables. It was considered that inclusion of 
the latter parameter might shed some light on 
the ranges over which the foregoing principles 
are most effective. 


Procedure 
Task 


The task used in this study was essentially the 
same as that described in several previous reports 
(4, 5, 6). The physical layout consisted of three 
components: the displays, the controls, and the com- 
munication system. 

The display component consisted of 6 AC volt- 
meters on each of which an “instrument” dial was 
simulated. The six distinct simulated instruments 
were: Rate of climb (RC), Air speed (AS), Fuel 
pressure (FP), Manifold pressure (MP), Air tem- 
perature (AT), and Generator voltage (GV). The 
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Operating procedures: Power settings 


Manifold Air Turn power 
pressure temperature setting 
10 on 
30 20 on 
30 off 
10 on 
40 20 on 
30 off 
10 off 
50 20 off 
30 on 


Fic. 1, 


A representative “operating 
procedures” card. 


dial readings were programmed through flexible leads 
from a central console. Each instrument could be 
set at one of three positions. 

The control component consisted of 6 two-position 
switches which were also labeled to simulate aircraft 
controls. The names attached to these switches were: 
Landing-gear lever (LG), Power setting (PS), Se- 
lector knob (SK), Control switch (CS), Steering 
mechanism (SM), and Reset lever (RL). Associated 
with each switch was an “operating procedure” card 
similar to the representative card shown as Fig. 1. 


Booth 1 

RL 

AT X 

Booth 1 AS x 
Structure I Booth 2 sae 
FP 
Booth 3 MP 

Booth 1 AT xX 
Booth 2 ~ 
Structure II : 
GV 
Booth 3 FP 
MP 

AT xX 

Booth 1 AS xX 
RC 

Structure ITI . 

Booth 2 GV 
FP 
Booth 3 MP 


Fic. 2. 
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These cards indicated what correct switch settings 
should be, depending on certain instrument readings 
Correct settings for each switch were determined by 
the readings on a particular pair of instruments. 
The pattern of display-control relationships is shown 
by “X” entries in Fig. 2. 

The subjects sat in small booths which were en- 
closed on three sides so that direct communication 
was not possible; hence a communication system 
was required. Each booth contained a control box 
with a set of three levers and signal lights. Each 
lever and signal light corresponded to one of the 
three booths. Thus, to call Booth 2, Booth 1 de- 
pressed his No. 2 lever; this lighted the No. 2 lights 
on the control boxes in all three booths. To respond, 
the person in Booth 2 depressed his own No. 2 lever 
Each of the 9 levers was wired into a 20-pen Ester- 
line-Angus recorder so that a complete record of 
calls and acknowledgments could be obtained. Actual 
communication took place over 6-volt interphone cir- 
cuits with conventional headsets and microphones 


Subjects and General Administration 


Subjects for this experiment were basic trainees at 
Lackland Air Force Base. Six subjects reported for 
testing each day and were divided fortuitously into 
two teams of three subjects each for morning and 
afternoon testing. They were brought into the ex- 
perimental room, seated in their booths, and given 


Booth 2 Booth 3 
SK CS LG SM PS 
X 
X 
xX 
X 
X xX 
xX 
xX 
xX 
xX Xx 
xX 
x 
xX 
X xX 
xX xX 
xX xX 
x xX 


A schematic diagram indicating display and control assignments and 


required interactions for three experimental] structures. 
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an informal orientation talk on the general nature of 
the task and the use of the interphone equipment. 
Following this, they were given a brief practice ses- 
sion to insure that they understood how to interpret 
the instrument readings and the operating instruction 
cards. They then performed the task under one of 
the experimental conditions described below. 


Experimental Conditions and Design 


The three “structure” conditions which were em- 
ployed are schematized in Fig. 2. Column headings 
show booth numbers and the control switches for 
which each booth was responsible. These are con- 
stant throughout the experiment. In other words, 
the Reset lever (RL) was always placed in Booth 1, 
the Selector knob (SK) and Control setting (CS) 
were always placed in Booth 2, and the Landing- 
gear lever (LG), Steering mechanism (SM), and 
Power setting (PS) were always in Booth 3. The 
row headings indicate the placement of instruments 
under various experimental conditions. In Structure 
I, Air temperature (AT) and Air speed (AS) instru- 
ments were placed in Booth 1. In Structure IJ, only 
Air temperature (AT) was placed in Booth 1, and 
in Structure III, Air temperature (AT), Air speed 
(AS), and Rate of climb (RC), were placed in 
Booth 1. 

Since the functional dependence of any setting on 
a pair of instruments was the same for all three 
structures, it was necessary for group members to 
relay information on certain instruments under each 
structure. The information which had to be relayed 
and the “source” and “user” of the information are 
given by the off-diagonal entries in Fig. 2. For ex- 
ample, the “X” opposite AT and under PS in Struc- 
ture I indicates that Booth 3 had to obtain Air tem- 
perature from Booth 1 in order to adjust Power 
setting correctly. It will be noted that there are 
three such entries in Structure I, three in Structure 
II, and seven entries in Structure III. 

Looked at from the standpoint of individual op- 
erator responsibility, all structures are of course 
equivalent in assigning Booth 3 three switches; 
Booth 2, 2 switches, and Booth 1, only 1 switch. 
Under Structure I, on the other hand, observation 
responsibilities are equally divided—each booth is 
assigned two instruments. Structure II assigns the 
heaviest observation load to the booths which al- 
ready have the most control responsibilities. Struc- 
ture III counterbalances the control responsibilities by 
assigning the heaviest observation load to Booth 1. 

In terms of the principles discussed above, Struc- 
tures I and II are equivalent in autonomy and both 
are superior to Structure III. Structure III is su- 
perior to Structure-I in terms of load balancing and 
both are superior to Structure II. While this is not 
an ideal design for the purpose of making compari- 
sons, it permitted as good a test as could be devised 
of the relative force of these principles in view of 
other restrictions (primarily that total information 
transmission should be held constant). 

In addition to the structural condition, a further 
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experimental condition was investigated. This was 
the rate at which instrument readings were changed. 
Under the high-load * condition, two readings were 
changed every 10 sec.; under the medium-load con- 
dition, two instruments were changed every 15 sec.; 
under the low-load condition, two instruments were 
changed every 20 sec. A change in two instrument 
readings might necessitate a change in two or three 
control settings. 

A given team performed under all three structure 
conditions at a fixed-load level. Groups were ran. 
domly assigned to one of the 6 permutation orders 
of the structure conditions: I II III; I III Il; II 
IIT; 12 WI 1; W111; and III 1. Thus 18 groups 
were required for a complete factorial study of the 
6 permutation orders and three loads. 

Under each structure a fixed set of 18 instrument 
changes (at 10 sec., 15 sec., or 20 sec.) was repeated 
three times. Previous experience with the task had 
indicated that specific sequence learning was most 
unlikely, and this exact repetition permitted tests of 
internal consistency of performance. 


Results 
General Performance 


The primary analysis is concerned with gen- 
eral differences in performance attributable to 
experimental treatments. For this purpose an 
analysis of variance was performed on the en- 
tire set of “error” scores. Error scores repre- 
sent the number of times that a given control 
was not set at the prescribed setting for a 
given information input until the next change 
in information. 

The classification for analysis of variance 
and the “levels” for each are as follows: 

1. Controls. These are the six distinct 
switches (PS, RL, SK, LG, SM, and CS). 

2. Periods: These are the three repetitions 
of the input sequence. 

3. Sessions. These are the three experi- 
mental sessions which each group had during 
which they completed one of the 6 permuta- 
tions of the three structure conditions. There 
were three periods in each session. 

4. Structures. These are, of course, the 
three structures discussed above. 

5. Loads. These are the three rates of 
change of input information (10 sec., 15 sec., 
and 20 sec.). 


2 Note that the term “load” is used in this report 
to describe relative individual responsibility as well 
as over-all demands on the group. Where there is 
danger of confusion, “over-all load” will be used for 
the latter aspect. 





Man-Machine System Design 


In addition to these 5 factorial classifica- 
tions of the data there are two replications. 
These arise from the fact that there are two 
permutations of the three structures which 
lead to groups performing in the same struc- 
ture during the same session. For example, 
groups running in orders I, II, III, and III, 
II, I both have Structure IT in the second ses- 
sion. (It was assumed that the “carryover” 
effects of I on II and III on II could safely 
be confounded with error.) The total de- 
grees of freedom are therefore 6 X 3 X 3x 3 


Table 1 


Complete Factorial Analysis of Control Error Scores ¢ 


Sums of 


Source Squares ~*MS 


Total 

Controls (a) 

Periods (b) 

Sessions (c) 

Structures (d) 

Loads (e) 

aXb 

aXc 

aXd 

aXe 

bXec 

bxd 

be 

cXd 

c Xe 

dXe 

axbxXe 

axXbxXd 

axXbxXe 

aXcXd 

aXc Xe 

axdXe 

bxcxXd 

bXc Xe 

bxdxXe 

cXdXe 

axXbxXcXd 40 
axXbxXcxXe 40 
axXbxXdXe 40 
axXcxXdXe 40 
bxcxKdxXe 16 
axbxXcxXdXe 80 


8,298.2 
921.9 
164.2 
293.4 
328.0 
664.5 

70.6 
220.2 
342.1 
185.6 

63.9 

16.3. 

17.6 

59.9 

41.1 

11.2 
106.2 

64.2 

69.7 
173.1 
215.7 
134.4 

44.9 

19.5 

47.4 
186.2 
145.4 
144.5 
156.9 
372.7 

34.8 
384.7 


8.55 
184.38 
82.10 
146.70 
164.00 
332.25 
7.06 
22.02 
34.21 
18.56 
15.98 
4.08 
4.40 
14.96 
10.28 
2.80 
5.31 
3.21 
3.49 
8.66 
10.79 
6.72 
5.61 
2.44 
5.93 
23.28 
3.04 
3.61 
3.92 
9.32 
2.18 
4.81 


Remainder 486 2,597.4 5.34 


+ F ratios and significance levels are presented in separate 
tables. 
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Table 2 


Totals for Structures, Sessions, and Loads Pooled over 
Controls, Periods, and Replications 


Structures 


Session 


Load I II III 


10” 150 
15” 103 
20” 45 
e 298 


10” 118 

15” 91 87 
20” 66 8&3 
= 275 301 


10” 118 127 
15” 57 98 
20” 38 37 
T 213 262 


Structure totals 786 = 1,016 


Source df MS It 

c Sessions 146.70 
164.00 
332.25 
14.96 
10.28 
2.80 
23.28 


9.81" 
10.91* 
$2.33" 


d Structures 

€ Loads 

cxd Sessions X structures 
ce Sessions X loads 

dxe Structures X loads 
cXdXe Structures X session X 


x & S&F we Nw NW 


loads 


¢ First-order interactions are tested against the second-order 
interaction. Main effects are tested against their larger first 
order interaction. 

* Significant at the 5% level of confidence 

** Significant at the 1% level of confidence, 


*~3x2-—1 or 971. Table 1 presents the 
complete analysis of variance.’ 

Table 2 shows totals and the analysis of 
variance for experimental effects of primary 
interest, i.e., structures, sessions, and loads. 
The increasing order of difficulty from Struc- 
ture I to Structure III obtains regularly over 
all sessions and loads as confirmed by its F 


* This complete analysis of variance is presented 
for reference but the discussion is confined to a rela- 
tively few sources of variance which are more clearly 
indicated by subtables. In view of the somewhat 
confused situation which currently obtains with re- 
gard to pooling or not pooling interaction sums of 
squares (2) we shall lean in the “conservative” di- 
rection here and discuss specific F ratios. The ob- 
tained Fs are shown at the bottom of subtables from 
which they were derived 
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ratio. This ordering supports the interpreta- 
tion that both load balancing and autonomy 
are influential but that the latter is more 
heavily weighted in this task. That is, the 
fact that lower error scores are obtained on 
Structure I than Structure II can be attrib- 
uted entirely to load balancing since they are 
equivalent in autonomy. The difference be- 
tween Structures I and II and Structure III, 
which are in favor of the former, can be at- 
tributed entirely to autonomy since Structure 
III is more completely “load balanced” than 
either of them. 

Table 2 also demonstrates the consistency 
of session effects for the three structures, 
shown in terms of steadily decreasing error 
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totals from Session I to Session III. Simi- 
larly, it may be seen that increased load (as 
indicated by decreasing time intervals) leads 
to increased errors for all three sessions. Al- 
though the secondary variables of load and 
prior experience influence error scores in their 
own right, they do not apparently interact 
with the structure variable, as is shown by 
the low first-order interactions. The rather 
higher second-order interaction between struc- 
ture, session, and load probably arises from 
the fact that in the present design this source 
contains a large between-groups variance com- 
ponent. 

Having demonstrated the consistency of 
over-all differences among the three structures 


Table 3 


Individual Control Error Totals by Structure and Session 


Booth 1 


Structure Session RL SK 
I 10 44 
22 41 
14 29 
46 114 


(2; 0, 0) 


83 31 
23 42 
17 15 
123 8% 


(1; 1, 0) 


26 58 
25 60 
9 50 
oO 168 


(2; 0, 0) 


Source 


a Controls 
c Sessions 
d f Structures 
aXc 

a Xd 
cxXd 
axXcxXd 


Controls sessions 
Controls K structures 
Sessions X structures 


+ First-order interactions are tested against the second-order interaction, 


order interaction, 
* Significant at the 5% level of confidence. 
** Significant at the tA level of confidence, 


Booth 2 


(1; 1, 0) 


(2; 0, 0) 


(0; 2, 0) 


Controls K session structures 20 


Booth 3 


CS LG SM 


31 55 63 
28 78 A 
55 35 35 
114 168 142 


(2; 0, 0) (1;1,0) (2; 0,0) 


93 52 79 115 
63 62 45 66 
92 55 29 54 
248 , 169 153 235 


(1; 1, 0) (2; 0, 0) (2; 0, 0) (1; 1, 0) 


104 &3 114 97 
90 76 100 92 
78 52 71 62 

272 211 285 251 


(1; 1,0) (0; 2, 0) (1; 1,0) (1; 1, 0) 


df MS Ft 


5 184.38 
2 146.70 
2 164.00 
10 22.02 
10 34.21 
4 14.96 
8.66 


5.39* 
6.67* 
4.79* 
2.54* 
3.95** 
1.73 


Main effects are tested against their larger first- 
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over the range of conditions studied, it is of 
interest to examine error totals associated with 
specific controls. Table 3 shows individual 
control errors by structure and sessions, to- 
taled over six groups and three periods of 
performance. Column subtotals indicate error 
totals associated with specific controls in a 
particular structure and as the accompanying 
analysis of variance shows, differences among 
these subtotals are highly significant as com- 
pared with their interaction with sessions (i.e., 
source a X d divided by aX c X d). It will 
be noted that these differences are in addition 
to first-order control effects which might be 
attributed to human engineering factors such 
as ease of positioning control switches, likeli- 
hood of errors in reading instruction cards, 
etc. 

Following an analysis suggested in an ear- 
lier report (6) we can identify three types of 
linkages here in terms of the source of in- 
formation for specific controls (cf. Fig. 2). 
These are the (2; 0, 0) linkage types for 
which the agent has directly two items of in- 
formation; the (0; 2, 0) linkage types for 
which two items must be obtained from a 
single other source; and the (1; 1, 0) linkage 
types for which a single item must be ob- 
tained from an outside source. These are in- 
dicated in Table 3 below the subtotals, and 
inspection suggests that there are real differ- 
ences associated with these three types—even 
where the same operators are concerned. 

Our general concern with structure then 
leads us to postulate three classes of effects 
accounting for the variance in the 18 column 
subtotals of Table 3: 

1. Effects arising from the amount of 
“work” falling on an individual operator by 
virtue of his assignment to one, two, or three 
controls. This dimension corresponds exactly 
to booth numbers. 

2. Effects associated with the three “link- 
age types” (2; 0,0), (0; 2,0) and (0; 1, 1). 

3. Structural “context” effects arising from 
over-all differences in interference and dis- 
tracting demands, presumably increasing from 
Structure I to Structure ITT. 

Since these effects are not orthogonally ar- 
ranged in the present design—e.g., (0; 2, 0) 
linkage type occurs only in Structure IIT—it 
is necessary to estimate their magnitude by a 


Table 4 


Comparison of Control Error Values Under Various 
Conditions with Estimates Based on 
Additive Assumptions 


Obtained 
Errors 


Error of 
Estimate 


Estimated 


‘onditions Errors 


aybycy 46 
asbsc) 


14.41 
164.87 

92.32 
195.66 
123.11 
195.06 
125.30 
130.66 
203.21 
161.45 
161.45 
234.00 

88.82 
173.99 
239.28 
204.78 
270.07 
270.07 


-31.59 
50.87 
21.68 
27.66 
18.89 

- 6.34 

2.30 
42.66 
44.79 

- 7.55 

8.45 

1.00 
28.82 

5.99 
32.72 

6.22 
14.93 
19.07 


asbicy 
asbyc; 
agbicy 
agbscy 
aybyce 
agbycy 
agba 2 
aghiCe 
asbice 
agbace 
abu 3 
arbecy 
Aebycy 
asboc, 
agbacs 
Aybycs 


Source df MS PF 


93.605 
229.00 
19.76 


Individual control totals 17 
Fitted controls 6 
Residual 11 


11,59** 


** Significant at the 5% level of confidence 


special technique (cf. [1] pp. 278-284). 
This requires the solution of least-squares 
estimating equations for three sets of three 
constants which represent the “levels” in 1, 2, 
and 3 above. It is assumed that these classes 
of effects do not interact although it might be 
noted that “context” effects, Class 3, may be 
construed in part as interactions between 
other factors. 

Table 4 presents the various combinations 
of effects and associated subtotals taken from 
Table 3. The third column of this table 
shows the control errors that would be pre- 
dicted from the fitted values of the constants * 


4 The values of the fitted constants were: 


a; = — 80.32 bi 37.10 a 37.56 
aa—-— 241 bs 29.84 Ca .78 
as — 28.38 bs = 35.45 Ce 36.85 


where subscripts refer respectively to the three levels 
of each of the effects suggested in the text. This en- 
tire procedure, however, should be considered as 
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and the fourth column shows the resulting 
errors of estimation. As the analysis of vari- 
ance at the bottom of Table 4 shows, these 
three sets of estimated constants account for 
the bulk of the variance. It may be con- 
cluded from this that control errors are largely 
due to additive effects of individual load, link- 
age type, and total structural difficulty. 

A further attempt was made to relate indi- 
vidual control errors more directly to the gross 
load on the operator, including observation 
and communication demands as well as con- 
trol responsibilities. It was considered that 
such a finding would clearly demonstrate the 
effects due to uneven load balancing. How- 
ever, it does not appear possible to treat these 
data in this way. Other factors affecting con- 
trol errors are evidently distributed in a 
rather undifferentiated fashion over all three 
control operators within a given structure 
condition. The inference is that an over- 
loaded individual is as likely to neglect obli- 
gations to other group members, thereby in- 
creasing their errors, as he is to neglect his 
own control responsibilities. 


Group and Replication Effects 


Several questions of primarily methodo- 
logical interest have not been treated above. 
The first question is whether group perform- 
ance is consistent within a given structure 
condition, and the second concerns the merits 
of the particular method of replication em- 
ployed here. A single analysis permits in- 
vestigation of both questions. 

Within a given load treatment 6 groups 
were run, representing the 6 permutation or- 
ders of Structures I, II, If. If we classify 
three of these permutations, namely I, II, III; 
II, III, 1; and III, I, II, as “clockwise,” and 
the other three, I, III, II; Il, I, III; and III, 
II, I, as “counterclockwise,” it will be ob- 
served that each of the clockwise permuta- 
tions coincides with a counterclockwise per- 
mutation in exactly one place. For example, 
the clockwise permutation II, III, I coincides 
with the counterclockwise permutation I, ITI, 
II in the second place. Thus we have a set 
of three squared or nine direct comparisons 


illustrating one interpretation of the data rather than 
as a precise determination of the magnitude of effects. 
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Table 5 


Analysis of Variance of “Clockwise” (CW) and 
“Counterclockwise” (CCW) Sequences 


Source df 


MS F 





1) CW-CCW 

2) Within CW 

3) Within CCW 

4) Within CW Xwithin CCW 


1 1052.64 
2 
2 
4 
5) LoadX1 2 
4 
4 
8 


109.10 
21.34 
—76.36 
38.03 
282.41 
211.70 
104.56 


166.96 
59.15 


6.70* 


6) Load x2 
7) Load x3 
8) Loadx4 


Total (groups) 27 
Residual 54 


* Significant at the 5% level of confidence. 
** Significant at the 1% level of confidence. 


between groups for each of the three loads. 
For each of these comparisons there are three 
subscores based on the three periods. Table 5 
presents the analysis of variance of these dif- 
ference scores. 

The first result to be noted is that group 
performance is consistent in the sense that 
differences between group error totals are sig- 
nificant as compared with their interactions 
with periods. None of the interaction effects 
relating to the two sets of permutation orders 
and loads are significant, but the over-all dif- 
ference between clockwise and counterclock- 
wise performance is significant at the 5% 
level as compared with pooled interactions 
(with 1 and 22 df). The difference is in fa- 
vor of the counterclockwise order. Since fur- 
ther inspection of the data fails to suggest 
any rational explanation for this difference, it 
seems justified in this case to write it off as 
sampling fluctuation. Nevertheless, it indi- 
cates that this sort of design, while highly 
economical in terms of subject usage and ad- 
ministrative burden, may lead to unduly large 
error variance estimates due to “carryover” 
effects between experimental structure condi- 
tions. 


Communication 


The communication frequency data were 
analyzed for effects due to structure, load, and 
experience, as well as possible relationships 
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with performance scores. Since most of the 
findings merely duplicate previously reported 
results (4, 5), they will be summarized only 
briefly. 

First, the three different structures elicit 
significantly different patterns of communica- 
tion (F = 13.40 with 16 and 16 df). Inspec- 
‘tion of the data supports the generalization 
that frequency of communication is related to 
relative informational requirements, but not 
in any simple fashion. 

Second, the over-all number of communi- 
cations decrease with increasing load (F = 
18.76 with 2 and 16 df) and tends to be con- 
stant for the actual time available (i.e., 10 
sec., 15 sec., or 20 sec., respectively, for the 
three loads). The frequency of communica- 
tions increases with task experience (F = 
10.63 with 2 and 4 df) and this increase 
seems to occur equally in all structures (i.e., 
the sessions by structure interaction is negli- 
gible). Finally, there is mo evidence of any 
relationship between frequencies or patterns 
of communication and performance differences 
between comparable groups. 


Discussion 


The present experiment must clearly be 
considered as illustrative of certain factors in 
system design rather than as a definitive treat- 
ment of the problem. While the results ob- 
tained substantiate the general fruitfulness of 
this area of investigation, there are several 
major uncertainties that inhibit any extensive 
generalization. First, only a very narrow 
range of structures has been explored in this 
and related studies, and other “principles” of 
equal or greater importance may yet be en- 
countered.’ Such principles may in fact point 
to a way of escape from the dilemma posed 
by the conflict between load balancing and 
autonomy shown in the present study. It 
has been suggested that one way to overcome 
this limitation in the generalizability of re- 
sults might be to use a random sampling of 


5 As an example, an earlier study (3) attempted to 
determine the influence of homogeneity (in func- 
tional terms) of individual assignment as opposed to 
autonomy. While the results were somewhat incon- 
clusive in this regard, it still seems that homogeneity 
of individual assignment is a valid and important 
principle. 
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structures as defined by observation and con- 
trol assignments and to order them on the 
basis of an ex post facto analysis of perform- 
ance. A second limitation of the present 
study is the difficulty in equating “work units” 
of observation and control operation. It 
might turn out, for example, that if the ob- 
servations required more active reconnais- 
sance activities on the part of group mem- 
bers, load balancing would become more 
critical for performance. 

On the other hand, this experiment does 
demonstrate that, for this task, both load 
balancing and autonomy are effective princi- 
ples. The superiority of a structure in which 
both autonomy and load balancing are con- 
sidered as opposed to structures in which one 
or the other is slighted amply bears out this 
point. It has also been demonstrated that 
autonomy is much more critical than load 
balancing over the range investigated in this 
study. 

The more analytic approach to these re- 
sults in terms of single control errors further 
helps to understand these effects. As in previ- 
ous studies (6) we find that not only is the 
amount of relayed information in the non- 
autonomous conditions critical but also the 
way in which this information is distributed. 
As suggested previously (5), this indicates 
that the major problem faced by groups is 
not simply one of transmitting a large volume 
of information but of phasing messages so 
that transmitted information reaches its desti- 
nation at the time it is needed. Although 
there is no direct proof in the present data, 
the performance differences attributable to 
autonomy may be related to the necessity for 
overloaded operators to neglect certain as- 
pects of their jobs—especially under high 
over-all load. 

Analysis of the communication records pro- 
vides additional evidence for statements made 
above and in previous reports. Under the 
present learning conditions, at least, groups 
do not fully adapt to increased load on the 
individual or on the entire group. The bur- 
den of initiating communications is placed 
on the user of information rather than the 
immediate source, and, as a consequence, 
much of the relevant information is com- 
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pletely lost. Thus, it might be said that 
much of the potential for “load balancing” 
provided by the formal or assigned structures 
used here is never realized by the groups. 


Received August 24, 1956. 
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Some Biographical Determiners of Participation in Group 
Discussion * 
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Group discussion is currently being utilized 
not only as a research medium, but also as 
a vehicle for communicating information and 
for organizing community efforts toward com- 
mon goals. Problems of communication in 
such groups can better be understood with in- 
creased information about the composition of 
such groups and the impact of such composi- 
tion upon the discussion process. A number 
of questions in this area remain unanswered. 
Does information aimed at a certain class 
of recipients, for example, actually stimulate 
these individuals in some tangible fashion? 
Do various community service information 
programs elicit the active participation of 
those community members who are the prin- 
ciple targets of this type of educational ef- 
fort? The community discussion group pro- 
vides a naturalistic setting in which the effects 
of a communication, such as a motion picture 
film, may be assessed. Knowledge of the fac- 
tors determining participation in such group 
discussion should prove of value in approach- 
ing the kinds of questions posed above. 

It is apparent to anyone who has observed 
the behavior of community discussion groups 
that extent of participation is not normally 
distributed among the members. Indices of 
participation have been developed and de- 
scribed by several investigators (1, 4, 5). 
However, the distribution of participation in 
groups of varying sizes has not received much 
attention in the literature, perhaps due to the 
difficulty of obtaining accurate counts of the 
number of words contributed by the members 
of large audiences. From the recorded dis- 
cussions of groups having as few as 16 and 
as many as 85 members, we have determined 
the curves of verbal output as well as other 
quantitative measures describing the discus- 


1This research was supported by Special Grant 
3M9064 from the National Institute of Mental 
Health, United States Public Health Service. 

2Now with Psychological Research Associates, 
Alexandria, Va. 


sion process (3). Figure 1 shows the dis- 
tribution of verbal output in groups classified 
as “small” (Ns of 16 to 18) and “large” (Ns 
of 39 to 85). A J-shaped function describes 
the distribution of participation in the large 
groups, whereas a highly skewed curve is gen- 
erated with data from the small groups. A 
relatively few individuals, in short, tend to 
dominate the discussions in groups of any size, 
with the proportion of such active participants 
decreasing in larger groups. It should be 
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Fic. 1. Distributions of verbal output in large and 


small discussion groups during approximately half 
hour sessions, 


noted, of course, that the exaggerated J-shape 
in the case of the large groups is partly at- 
tributable to the fact that only a limited num- 
ber of individuals in such groups can partici- 
pate in discussion during a half-hour period. 

Our concern in this investigation was with 
identifying certain distinguishing character- 
istics of those individuals who participate 
voluntarily in a group discussion on a topic 
of common interest. Two general hypothe- 
ses were tested: (a) Participants and nonpar- 
ticipants in the community group discussion 
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situation may be differentiated in terms of 
certain biographical variables. (5) These 
biographical predictors will operate independ- 
ently of group size or type of film presented 
as a topic for discussion. 


Procedure 


A questionnaire was developed containing bio- 
graphical items that were judged as relevant to the 
problem under investigation. They appeared logi- 
cally to be classifiable into four general categories 
as follows: 


A. General 
1. age 
2. sex 
3. marital status 
4. number of children 
B. Socioeconomic 
1. education 
2. income 
3. home ownership 
C. Familiarity with the discussion area 
1. self-rating of familiarity with the topic 
2. number of relevant communications previ- 
ously experienced 
3. previous attendance at similar discussions 
DD. Group affiliation 
1. experience as group leader 
2. frequency of attendance at group meetings 
3. extent of acquaintance with other group 
members 
4. number of group memberships 
5. reasons for attending group meetings. 


These fifteen predictors are certainly not inclusive 
of all of the factors that might determine discussion 
participation in community groups. Selection of the 
variables to be measured, however, was made upon 
the basis of their probable relevance to the particu- 
lar type of discussion studied as well as to the ease 
and accuracy with which they could be determined. 

Seven community groups of various sizes were con- 
tacted in the suburban Washington, D. C., area to 
cooperate in the research program. Four of these 
were PTA groups ranging in size from 39 to 85 mem- 
bers, and three were child study groups having from 
16 to 18 members. 

The group members, who met in their accustomed 
places, were first given a 4-page questionnaire con- 
taining items that covered the factors listed above. 
They were assured that the information would re- 
main confidential and were permitted to answer the 
questions anonymously, Following showing of a 
mental health film dealing with child-family rela- 
tionships, the group engaged in approximately a half- 
hour of discussion under professional nondirective 
leadership.* The discussions were tape recorded, and 

8 Dr. C. N. Cofer was kind enough to serve as 
discussion leader for all of the sessions. The follow- 
ing films were shown as subjects for discussion: 
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a seriatim record was kept of the location of all dis- 
cussion participants. Since the group members left 
their questionnaires upon their seats, it was rela- 
tively simple to identify later each discussant with 
his recorded comments, even though his name was 
not used. A complete description of the method- 
ology employed is described elsewhere by McGinnies 
(2). Biographical information obtained as well as 
counts of total verbal output for each group mem- 
ber were later coded and transferred to keysort cards 
for tabulation. 

The keysort cards, representing all of the group 
members, were divided into participants and non- 
participants according to whether or not a given in- 
dividual had voluntarily entered a discussion. These 
two groups were then distributed among the response 
categories for each of the fifteen descriptive variables 
and tested for independence by means of chi square. 
The 15 variables yielded two types of contingency 
tables, 2 X 2 and 2 X k, which were evaluated by the 
appropriate formulas. The total number of cases 
available for testing against each variable differed 
slightly, as varying numbers of the 324 individuals 
failed to respond to any given question. This num- 
ber, however, was small so that in general the par- 
ticipant group was represented by about 100 cases, 
while the nonparticipant group averaged 200 in each 
of the comparisons. 

Since the seven groups varied in size and had in 
some cases seen different films, these two conditions 
may have systematically predisposed individuals of 
different background characteristics to participate in 
discussion. To determine whether size of the discus- 
sion group interacted with the biographical variables 
in determining who participated, we classified the 
discussants according to two categories of group size 
and tested the homogeneity of their background 
characteristics with chi square. We also examined 
the possibility that the biographical descriptions of 
the discussants might vary as different films were 
employed as vehicles for discussion. Since three of 
the groups had discussed the same film, Angry Boy, 
while the other four groups had each discussed dif- 
ferent films, the background descriptions of these 
two categories of discussants could be compared. 
Accordingly, discussants were classified as having 
seen either Angry Boy or one of the other films, and 
the two groups were tested for homogeneity with re- 
spect to each of the fifteen biographical variables. 


Results 


The data were analyzed in three stages. 
First, the 15 selected personal history vari- 
ables were tested for significance in distin- 
guishing discussion participants from nonpar- 
ticipants in the combined groups. Second, the 
discussants only were tested for biographical 


Angry Boy, Your Children and You, Farewell to 
Childhood, Meeting Emotional Needs in Childhood, 
and Why Won’t Tommy Eat? 
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homogeneity with regard to the 15 variables 
when differentiated according to whether they 
were members of large or small groups. Third, 
the discussants were examined for homoge- 
neity of background factors when categorized 
as having viewed Angry Boy or some other 
film. In all of the chi-square tests the .05 
probability level was accepted as a basis for 
rejecting the null hypothesis. The biographi- 
cal factors will be considered within each of 
the four categories into which they seemed 
logically to fit. 

General. This group of variables included 
the age, sex, marital status, and number of 
children of the group members. None of 
these factors was significantly related to par- 
ticipation or nonparticipation in group dis- 
cussion. One reason for the lack of relation- 
ship between any of these characteristics and 
discussion participation may be the fact that 
membership in community groups of the types 
observed acts as a selective factor limiting the 
range of these variables. For example, 60% 
of all of our subjects fell in the 30-39 age 
category, 80% were females, 96% were mar- 
ried and living with their spouses, and 98% 
had one or more children. For other types 
of groups these factors may prove significant 
in predicting discussion participation. In the 
typical PTA and child-study group, however, 
they seem to be so restricted in range as to 
be of little discriminative value. 

Socioeconomic. The variables classified un- 
der this heading included education, income, 
and home ownership. Both education and in- 
come proved to be significantly related to 
participation in discussion. The discussants 
were, in general, better educated and en- 
joyed higher incomes than the nondiscussants. 
Rental or ownership of home failed to differ- 
entiate discussants from nondiscussants, al- 
though a greater percentage of the discussants 
were home owners. Since 87% of the total 
sample were home owners, the statistical test 
was probably not sensitive enough to detect 
the relatively small difference that existed. 

Familiarity with the discussion area. Un- 
der this heading were subsumed those vari- 
ables which indicated the subjects’ experience 
with group discussions, mental health films, 
and mental health information in general. 


tive in the discussion. 
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Each of the group members had been asked 
to check the position on a 5-point rating scale 
which best represented his familiarity and 
amount of information with respect to men- 
tal health problems. Since most of our re- 
spondents modestly refrained from checking 
Category 5, which was labeled “a great deal 
of information and familiarity with mental 
health problems,” this position was combined 
with Category 4 in order to provide sufficient 
cases for the analysis. The results indicated 
that individuals who rated themselves as rela- 
tively familiar with the subject matter of 
mental health were more likely than less con- 
fident persons to participate in an ensuing 
discussion. Similarly, those members who 
reported having seen a greater number of 
mental health films in previous club experi- 
ence were significantly more likely to be ac- 
It apparently made 
no difference, however, whether films seen in 
the past had served as topics for discussion. 
Those respondents who reported prior experi- 
ence with group discussion of mental health 
films were no more prone to participate in the 
current discussion than those who did not re- 
port such experience. Whether or not the re- 
spondent had participated in such previous 
discussions, however, was unknown. It should 
probably also be noted that many of the 
respondents reported as mental health films 
various screened and televised productions 
which were viewed under informal conditions 
not followed by organized discussion. 

Group affiliation. The questions related to 
this area probed the status of the individual 
in the group, his interest in the group, his de- 
gree of acquaintanceship with the other group 
members, his past experience with commu- 
nity groups, and his reasons for attending the 
group’s meetings. Four of these five factors 
emerged as significantly related to partici- 
pation in discussion. Their predictive value 
may be described as follows: (a) Those in- 
dividuals who were currently officers in the 
group, or who had been officers in other 
groups, were more inclined to participate. 
(b) Those members who had attended a 
greater number of previous meetings were 
more likely to be among the discussants than 
those individuals who had attended fewer 
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previoas meetings. (c) The larger the ex- 
tent of an individual’s personal acquaintance- 
ship with other group members the greater 
was the probability that he would participate 
in discussion. (d) Those persons who were 
members of several community groups were 
more likely to participate than individuals 
who were members of few or no groups. The 
fifth factor—reasons for attendance—failed to 
yield a significant relationship with participa- 
tion, 

Fight of the fifteen biographical factors 
emerged from the analysis as reliable pre- 
dictors of discussion participation. Of the 
eight chi-square values obtained, seven were 
significant at better than the .01 level. The 
percentages of discussants and nondiscussants 
according to the breakdowns within the dis- 
criminating categories are presented in Fig. 2. 
It may be noted that with few exceptions 
there is a step-wise increase in percentage of 
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Fic. 2. Percentage distributions of participants 
and nonparticipants in discussions of mental health 
films according to eight significant predictor variables. 
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participants with increments in the particular 
variable under consideration. 


Group Size as an Interaction Factor 


Although a number of biographical deter- 
miners of discussion participation proved to 
be statistically reliable when tested over all of 
the groups, a possibility remained that some 
of these might operate differently in groups 
of widely discrepant sizes. The groups em- 
ployed in the study fell rather obviously into 
two classes: the large groups with member- 
ships of 39 to 85 and the small groups having 
from 16 to 18 members. The discussants, 
therefore, were classified as belonging to either 
the large or the small group categories and 
were tested for homogeneity within the re- 
sponse categories of the 15 biographical vari- 
ables. The results of this analysis can be 
stated in summary form, with mention being 
made only of the significant differences. 

While the majority of the participants in 
both large and small groups were female, sig- 
nificantly fewer males participated in the 
small group discussions. This result was 
clearly a function of sex proportion differ- 
ences in the general composition of the large 
and small groups. Two of the three small 
groups, for example, were composed entirely 
of women. There was also a tendency for 
more individuals in the lower income cate- 
gories to participate in the large group dis- 
cussions, and this result, too, was attributable 
to a general economic difference between the 
large and small groups. Members of the large 
groups represented a more modest level of in- 
come. When distributed along a continuum 
of number of mental health films previously 
viewed, the discussants in the large groups 
tended to fall more frequently at the lower 
extreme. This difference probably reflects 
the fact that large PTA groups simply do not 
program films of this type as frequently as 
small study groups, and hence the participant 
members have lacked opportunity for this 
type of experience. 

Finally, the discussants in the large and 
small groups were homogeneous with respect 
to all of the group affiliation factors except 
extent of group acquaintanceship. Those dis- 
cussants who were members of the large groups 
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were personally acquainted with a smaller per- 
centage of the total membership than were the 
participants in the small groups. Conse- 
quently, extent of personal familiarity in the 
group did not relate systematically to par- 
ticipation in the large groups. The gener- 
ality of our earlier conclusions with respect 
to the combined large and small groups, 
therefore, require modification only to this 
extent, namely, that the number of films 
previously seen as well as extent of personal 
acquaintanceship in the group are reliable 
predictors of discussion participants only in 
small groups. 


Film Differences 


The possibility also exists that the bio- 
graphical determiners of discussion thus far 
described may not operate in the same di- 
rection with different films. Since our initial 


analysis treated all the discussions as though 
they were oriented about a single film, those 
biographical factors which might have inter- 
acted with a specific film were not isolated. 
In order, however, to partially determine 
whether the final set of pretlictors were in- 
variant with different film content, we grouped 


the discussants according to whether they had 
used Angry Boy or one of the other four films 
as their discussion topic. Two small groups 
and one large group had seen this film in 
common, while the remaining groups had 
viewed different films on the occasion of their 
first discussion meeting. 

Of the fifteen chi-square tests, only two 
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were significant. The film Angry Boy ap- 
parently stimulated proportionally more col- 
lege level persons and home owners to enter 
the discussion than did the other films. With- 
out further evidence, however, it would be un- 
wise to conclude that these factors invariably 
interact with film content in determining who 
will participate in a discussion. The more 
important finding is that all of the familiar- 
ity and group affiliation variables previously 
identified as related to discussion participa- 
tion remained independent of film differences. 
Furthermore, since one of the fifteen chi- 
square tests would exceed the .05 level by 
chance, we may conclude with reasonably 
high assurance that the predictors are rela- 
tively independent of specific film content, 


Relationships Among the Predictors 


Since the biographical variables which 
emerged as significantly related to participa- 
tion in discussion are in no sense “pure” fac- 
tors, a question arises as to the extent of the 
intercorrelations among them. A _ non-para- 
metric analysis was indicated in view of our 
lack of knowledge about the distribution of 
many of the predictors in the general popula- 
tion. Accordingly, all possible combinations 
of the eight reliable predictors were examined 
for independence by chi-square. The entire 
sample of over 300 Ss was broken down in 
terms of the response categories for each of 
the eight biographical items. Twenty con- 
tingency tables were thus generated and tested 
for significance with the appropriate degrees 


Table 1 


Relationships Among the Predictor Variables 


Informa 
tion 


Educa 


tion Income 


Education ) O1 O01 
Income x 

Information 

Films viewed 

Club official 

Meetings att’d 

Acquaintances 

Memberships 


Note 


reach significance. 


Films 


Member 
ships 


Acquaint 


Official Meetings ances 


Ol 


OO1 


Since the obtained chi squares are based upon different degrees of freedom and therefore not directly comparable, 
the probability levels of the significantly related factors are entered in the appropriate cells. 


All other comparisons failed to 
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of freedom. The results are shown in Table 1. 

Eight of the comparisons, as indicated in 
the table, were significant at better than the 
Ol level of confidence. These correspond 
rather strikingly with the logical groupings of 
the biographical items, although several addi- 
tional relationships are revealed. Educational 
attainment is related not only to high income 
level but also to self-rating of knowledge 
about mental health problems and member- 
ship in various organizations. Self-rating of 
information is related to number of mental 
health films seen. Status as a club official 
tends to be associated with number of meet- 
ings attended, extent of acquaintanceship in 
the group, and number of groups with which 
affiliated. Attendance at meetings, finally, is 
positively related to extent of group acquaint- 
anceship. 

If one examines the groupings in Table 1, 
it becomes apparent that two of the predictors 
show positive relationships, either directly or 
indirectly, with the remaining six. It follows 
that predicting participation in group discus- 
sion on the basis of a minimum of information 
about the discussion group members could 
best be accomplished using the factors of 
education and leadership. These are defined, 
specifically, as grade attainment in school and 
elected officer status in the group. To assess 
the actual effectiveness of employing only 
these two variables as predictors of discus- 
sion participation, we undertook an additional 
breakdown of the sample. Employing as cri- 
teria completion of high school, or better, and 
present or past status as a club officer, we 
pulled from the entire file of Ss those who 
met both requirements. Since these two pre- 
dictors account for only a portion of the vari- 
ance associated with discussion participation, 
it was expected that two types of errors would 
result, i.e., those involving the exclusion of 
individuals who had actually been partici- 
pants, and those involving inclusion of some 
who had been nonparticipants. Using the 
criteria of education and leadership as pre- 
dictors of the discussion participants, we drew 
67 of the 104 actual participants and rejected 
156 of the 220 nonparticipants. In short, 
these two predictors alone were 64% accurate 
in selecting participants and 71% accurate in 
rejecting nonparticipants. Although far from 
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Table 2 


Contingency Table Showing the Success of Differenti 
ating Participants (P) and Nonparticipants (NP) 
in Group Discussion Using Criteria of 
Education and Leadership 


(Theoretical frequencies are in parent heses) 


Predicted 


P NP Total 


Actual P 67 (42.05) 
NP 6A (88.95) 


37 (61.95) 104 
156 (131.05) 220 


Total 131 193 324 


Note.--x* = 36,60; p < .001, 


perfect, the results nevertheless exceed chance 
expectancy at better than the .001 level of 
confidence. A summary of the results of this 
breakdown is presented in Table 2. 


Summary and Conclusions 


The objective of this study was to identify 
a set of biographical factors which would dif- 
ferentiate participants from nonparticipants 
in a community group discussion situation. 
Seven community groups were invited to co- 
operate in the project. Both biographical 
and discussion data were collected from the 
324 group members. In testing the discus- 
sants and nondiscussants against each of the 
predictors in contingency tables, eight of the 
fifteen obtained chi-squares were found to be 
significant at or beyond the .05 probability 
level. 

The reliable predictors, grouped into three 
general categories, were: socioeconomic status, 
i.e., education, income; familiarity with the 
discussion area, i.e., self-rating on extent of 
information about mental health problems and 
concepts, number of mental health films previ- 
ously viewed; and group affiliation, i.e., status 
as a Club official, attendance record at meet- 
ings, extent of acquaintanceship in the group, 
number of group memberships. In each in- 
stance, a high or positive rating was associ- 
ated with a tendency to participate in group 
discussion. 

Interrelationships among the eight predic- 
tors were determined by further breakdown of 
the sample. Of the twenty possible compari- 
sons, eight revealed significant contingencies 
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between the variables concerned. The two 
factors having the greatest number of signifi- 
cant associations with the remaining six were 
education and leadership. The predictive 
power of these two factors alone was tested 
by pulling from the entire sample those indi- 
viduals who met both criteria. While con- 
taining some error, the results of this pro- 
cedure far exceeded chance probability in 
distinguishing between the actual participants 
and nonparticipants. 

It is evident that the J-shaped distribution 
of verbal output that appears in the group 
discussion situation is attributable in part to 
the selective operation of such biographical 
variables as have been described in this 
study. Of course, such factors as socioeco- 
nomic level, familiarity with the discussion 
area, and group affiliation may reflect other 
determinants which are more basic in dispos- 
ing an individual to engage voluntarily in 
this type of social interaction. Quite con- 
ceivably such factors as general intelligence 
and/or social dominance are associated with 
the three general categories of predictors that 
we have described. However, these are some- 
what more difficult to measure in a natural- 
istic setting, whereas estimates of the eight 
variables described are readily obtained. 

Further research in this area should prob- 
ably determine whether or not self-ratings of 
familiarity with a content area reflect real 
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knowledge or are simple indications of a gen- 
eralized self-confidence. The importance of 
the affiliation factor in determining discus- 
sion participation leads to a host of inter- 
esting questions concerning the relationship 
between sociometric patterning and communi- 
cation channels in discussion groups of vari- 
ous types. Finally, those interested in action 
research may find use for the present findings, 
which describe some distinguishing character 
istics of persons who voluntarily enter a free 
discussion under nondirective leadership. 


Received August 31, 1956. 
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Scale Analysis of a Fatigue Checklist 


Richard G. Pearson? 
School of Aviation Medicine, USAF 


The use of checklists for measuring affec- 
tive responses (e.g., fatigue, boredom, work 
attitude) has always been open to question 
because there has been no adequate method 
for determining their content. With the de- 
velopment of a technique of scale analysis by 
Guttman (10) the possibility of establishing a 
subjective measure of fatigue seemed promis- 
ing. However, an early criticism leveled 
against scale analysis was that the initial se- 
lection of items was left to the intuition of 
the investigator. To avoid this criticism Ed- 
wards and Kilpatrick (2) suggested that scale 
analysis be preceded by Thurstone scaling and 
item analysis in order to obtain a set of items 
which would have greater assurance of meet- 
ing the requirements of scale analysis. The 
study herein reported describes the applica- 
tion of the Edwards-Kilpatrick method to the 
development of a fatigue scale. 


Development of the Experimental Checklist 


Item selection. Fifteen Air Force enlisted 
personnel working in the department labora- 
tories were asked to list words and phrases 
which might describe a fatigue continuum. 
The continuum, at this stage, was roughly 
defined as one extending from extreme tired- 
ness on one end to extreme well-being on the 
other. The author also searched dictionaries 
and thesauri for appropriate items.  Alto- 
gether approximately 500 items were col- 
lected. These were then screened by the 
above individuals against three criteria: (a) 


would the item enjoy wide understanding, (5) 


was the item of a suitable vocabulary level, 
and (c) did the item fit the defined continuum 
or could it connote other affective states such 
as anxiety, boredom, motivation, or morale? 

One hundred and fifty items survived this 
initial screening. Four psychologists of the 
department discussed each of these items, ac- 

' The author is indebted to George E. Byars, Jr., 


for his assistance in the collection and scale analysis 
of checklist data, 


cepting or rejecting it as belonging to a fa- 
tigue continuum. Meanwhile, lists of the 
items were presented to 100 basic airmen 
with the instructions to indicate any items 
which they did not recognize or understand. 
These two procedures reduced the number of 
items to 92. The next problem, one of scal- 
ing, was to determine where each of these 
items belonged on the fatigue continuum. 

Thurstone scaling. Twelve qualified judges 
sorted the 92 items along a nine-interval con- 
tinuum according to Thurstone’s method of 
equal-appearing intervals. Interval 1 repre- 
sented extreme well-being; Interval 9 rep- 
resented extreme fatigue. Ambiguity (Q) 
values were computed, and indicated the re- 
jection of 48 items. Thus, with 44 items 
available for further analysis, the first step in 
the Edwards-Kilpatrick procedure had been 
completed. 

The experimental checklist. In developing 
the checklist format it was necessary to de- 
cide on a response system which would be 
appropriate to the requirements of scale 
analysis. The system chosen offered S a 
choice for each item of one of three response 
categories: better-than, same-as, or worse- 
than. One advantage of this system which is 
particularly appropriate in the measurement 
of fatigue is that S is forced to pinpoint him- 
self on the fatigue continuum. 

The experimental checklist was given the 
noncommittal name “Feeling-Tone Checklist.” 
Its 44 items were randomly ordered. Sepa- 
rate checklist instructions were developed for 
use in the experiment to be described. 


Developmental Study 


The developmental study was designed to 
provide data for item validity estimates and 
internal-consistency item analyses. To test 
the items for validity it was necessary to find 
a suitable criterion; that is, a situation had 
to be created which would definitely produce 
fatigue. Valid items would, of course, dis- 
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criminate significantly between a “nonfatigue”’ 
situation and the “fatigue” situation. 


Task. The apparatus chosen to produce the “fa- 
tigue” situation was the USAF SAM Multidimen- 
sional Pursuit Test (CM 813 E), fully described else- 
where (4, 5). Test Ss are required to manipulate 
throttle, stick, and rudder controls so as to com- 
pensate for the apparently random movements of 
four instrument pointers from their null positions 
When all four pointers are centered concurrently, a 
timer cumulates an accuracy score in units of 0.01 
minute. 

Ss performing on the test apparatus not only 
manifest task aversion both subjectively and objec 
tively, but also complain of tiredness in specific body 
locations (1, 8). Decline in task proficiency (work 
decrement) is evidenced after about an hour’s prac- 
tice. 

In the present study, two copies of the test ap- 
paratus were used to test Ss in pairs. A common 
cycling device metered our alternate work trials (1 
min.) and rest periods (15 sec.) for any desired span 
of time. 

Subjects. The experimental sample consisted of 48 
volunteer, experimentally naive, basic airmen. Ss 
were judged to have had adequate rest and to be 
otherwise fit for the task. 

Procedure. At 9:00 a.m. each testing day Ss were 
read the checklist instructions by a qualified ex- 
aminer, then they proceeded to fill out the experi- 
mental form. Immediately following this Ss were 
instructed in the operation of the test apparatus, 
then they received 40 trials of initial learning (9:15- 
10:05) to establish a substantial level of skill. This 
was followed by a 10-min. rest interval during which 
a motivational indoctrination was delivered and a 
performance feedback device was described. The use 
of these adjuncts, described elsewhere as “I,” (8) 
and “M,” (9), respectively, it was argued, was neces 
sary if the checklist was to reflect fatigue only 

After E had delivered the feedback indoctrina- 
tion, Ss returned to the task for four hours (10:15 
2:15). At the conclusion of the test period, Ss again 
filled out the experimental checklist 

Testing was conducted in an air-conditioned, well 
illuminated room. Ss could not 
other’s performance 


observe one an- 


Results. Item validity was inferred from 
chi-square tests of significance. An item was 
accepted as valid when it could be shown to 


2A 2-page table giving item validity and internal 
consistency data and marginal frequencies for the 44 
checklist items has been deposited with the Ameri 
can Documentation Institute. Order Document No 
5190 from ADI Auxiliary Publications Project, Pho- 
toduplication Service, Library of Congress, Wash- 
ington 25, D. C., remitting in advance $1.25 for 
microfilm, or $1.25 for photocopies. Make checks 
payable to Chief, Photoduplication Service, Library 
of Congress. 
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discriminate significantly between fatigued 
and nonfatigued criterion groups. The re- 
sponse-category frequencies from the first 
administration (A.M.) constituted the “nonfa- 
tigued” (or to be more explicit, “less :fa- 
tigued’’) criterion data; the response category 
frequencies from the second administration 
(pP.M.) constituted the “fatigued” criterion 
data. 

Internal consistency item analyses were 
performed on both a.m. and p.m. data. 
Checklists were scored using simple weights 
as follows: “better-than” response, 2; “same 
as” response, 1; “worse-than” response, 0. 
Each set of data was then divided into high 
score (N = 18) and low score (N 18) cri- 
terion groups. Chi square was then used to 
test the significance of difference between the 
marginal frequencies of the two criterion 
groups. The results revealed a definite trend: 
Significant items for the a.m. data were pre- 
dominantly from the positive end of the con- 
tinuum (Intervals 1-4), while those for the 
p.M. data were predominantly from the nega- 
tive end of the continuum (Intervals 6-9). 
In other words, an item tends to be signifi- 
cant when it falls within that part of the con- 
tinuum which seems to be “functioning” at 
the moment. 

Only a handful of items could be rejected 
on the basis of poor validity and internal 
consistency; therefore, the large number of 
“good” items remaining offered the possibil- 
ity of constructing two checklists rather than 
one. Since a typical Guttman scale consists 
of 10 to 12 items, a shortage of good items did 
not seem to be a problem. However, to find 
pairs of “equivalent” items from those avail- 
able was a different matter. The chief cri- 
terion used in pairing items was whether the 
two items had similar response category fre- 
quencies.’ Items from Intervals 1 through 4 
for the a.m. data and from Intervals 6 through 
9 for the p.m. data with internal consistency 
probability levels of greater than .05 were not 
considered. Two items from Scale Interval 9 
were included to “anchor” the checklists even 


* Picking an equal number of items from each 
Thurstone interval is unnecessary since the response 
category frequencies also indicate an items place on 
the continuum; Thurstone scaling is useful, however, 
for screening a large number of items 
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Table 1 


Descriptive Statistics for the Checklist Equivalent-Forms 


Marginal] Frequencies 


A.M. Data 


Better- Same- Worse- 


Item Form than as _ than 
1. Like I’m bursting 

with energy A 1 15 32 

I never felt fresher Bb 2 15 31 

2. Extremely peppy A 2 19 27 
Extremely lively Bb 1 19 28 

3. Very lively A 2 25 21 
Very fresh B 2 25 21 

4. Very refreshed A 1 30 17 
Very rested B 2 26 20 

5. Quite fresh A,B 2 35 1 
6. Somewhat fresh A 12 36 0 
Somewhat refreshed B 13 35 0 

7. Slightly tired A 30) 18 0 
A little tired B x0 18 0 

8. Slightly pooped \ 33 15 0 
A little pooped B 39 9 0 

9, Fairly well pooped A,B 44 3 1 
10. Petered out A 47 1 0 
Awfully tired B 47 1 0 

11. Very tired \ 48 0 0 
Tuckered out Bb 48 0 0 

12. Extremely tired A 47 1 0 
Weary to the bone B 47 1 0 

13. Ready to drop \ 48 0 0 
Dead tired B 48 0 0 


though not valid in terms of the data; it was 
hypothesized that these items would prove 
valid under more fatiguing conditions. At 
this point it was not possible to select equiva- 
lent items for Scale Intervals 3 and 7, and 
therefore two items had to serve “double 
duty” on both forms. No items were used 
from the “neutral,” or middle, zone (In- 
terval 5) in accordance with Edwards and 
Kilpatrick’s caution (3). Thirteen “pairs” 
of equivalent items were ultimately decided 
upon, and these comprised the equivalent- 
form checklists which were designated as 
Form A and Form B. Table 1 lists the check- 
list items along with their descriptive sta- 
tistics. The items of course were randomly 
ordered for use in the validation study to be 
described. 


Probability Level 


P.M. Data Internal Consistency 
Better- Same- Worse- Item A.M. P.M. 
than as than Validity Data Data 
0 1 47 001 OO1 70 
0 2 46 001 001 50 
0 2 46 O01 O01 50 
0 2 46 001 OO1 70 
0 4 44 001 O1 10 
0 2 46 O01 O01 50 
0 1 47 O01 Ol 70 
1 3 44 O01 Ol 20 
0 2 46 OO1 05 50 
0 10 38 001 OO OO1 
0 4 44 001 2 10 
2 25 21 O01 O1 OO1 
3 28 17 0o1 Ol 001 
3 32 13 001 O5 O5 
4 32 12 O01 30 OS 
9 35 4 OO} 70 001 
17 30 1 O01 1.00 001 
15 32 1 OO 8 OO1 
22 25 1 O01 1.00 001 
22 26 0 OO1 1.00 001 
33 13 2 O01 98 01 
30 14 4 O01 98 Ol 
38 & 2 50 1.00 Ol 
36 11 1 0 1.00 001 
Validation Study 
Method. Both the task and the source of Ss were 


identical with those previously described. At 9:00 
A.M. the experimental Ss were read the checklist in- 
structions, then they proceeded to fill out the Form 
A checklist (this data will hereafter be referred to 
as “1A”). Ss then received 40 trials of initial learn- 
ing (9:15-10:05) on the test apparatus after which 
they filled out the Form B checklist (hereafter, Data 
2B). This was followed by the motivational and 
feedback indoctrinations previously desctibed. Ss 
then returned to their task for three hours (10:15 
1:15) at the conclusion of which they were given a 
4-min. “rest period.” During this period Ss re- 
mained seated at their apparatus and filled out Form 
A of the checklist (Data 3A). Ss were then tested 
for an additional half hour (1:19-1:49). At the con- 


clusion of the testing program Ss received first one 
form of the checklist and then the other to fill out 
(4A and 4B—counterbalanced order). 

Ss were tested in pairs until a sample of 100 was 
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obtained. Experimental conditions were otherwise 
identical with those of the developmental study. 

Concurrently with the testing of each pair of ex- 
perimental Ss, pairs of control Ss were also “tested” 
in an adjacent room. These Ss received the same 
schedule of checklists as did the experimental Ss. A 
separate indoctrination given at the start of testing 
(9:10) was judged to have been successful in keep- 
ing these Ss alert during their 44-hour test period. 
When not engaged in filling out checklists, the con- 
trol Ss were allowed to read magazines, converse, 
write letters, and smoke 


Results. The procedure just described pro- 
vided 10 sets of checklist data: 1A, 2B, 3A, 
4A, and 4B for both experimental and con- 
trol Ss. The analyses performed were as fol- 
lows: 

First of all, the checklists were scored using 
the simple 2, 1, O weights described above. 
Product-moment correlations were then com- 
puted between S scores of Data 4A and 4B 
for both experimental and control groups. 
The resulting correlations, which are esti- 
mates of equivalent-form reliability, were .92 
for the experimental group and .95 for the 
control group. 

The determination of Form A checklist va- 
lidity was effected by a comparison of the 
feeling-tone level of the experimental and 
control groups at the first, third, and fourth 
checklist administrations. A repeated meas- 
urements analysis was made within the split- 
plot design of Groups * Administrations. Re- 
sults are shown in Table 2 where one should 
note the significant A * G interaction which 
points to a difference in slope for the group 
mean curves. This may be interpreted as 
meaning there is a greater decrease in feeling- 
tone over time for the experimental as com- 


Table 2 
Analysis of Variance of Form A Checklist Data 


Mean 
Square 


Source ot 
Variance 


1,744.22 
27.77 
2,011.17 
386.40 
7.47 


Groups 

Error (a) 
Administrations 
AXG 


Error (b) 


Total 


Table 3 


Subclass Means for Form A Checklist Administrations 


Administrations 


Group N ‘ No. 4 


Rows 


Control 
Experimental 


14.37 
9.92 
12.14 


15.68 
12.27 


4.4 


Columns 13.98 


pared with the control group. This finding is 
all the more noteworthy when one considers 
the fact that the feeling-tone level of the con- 
trol Ss showed a significant decline between 
the first and third administrations as demon- 
strated by a ¢ test (¢ = 3.80; P = .0O1) be- 
tween the subclass means (see Table 3). 
Thus the ability of this checklist to reflect 
a significantly greater decline in feeling-tone 
for the experimental group, when both ex- 
perimental and control groups became signifi- 
cantly “tired” in terms of checklist data, is 
more than adequate proof of its validity. 
The determination of Form B checklist va 
lidity was effected in a similar manner. Here 
the comparison was between the feeling-tone 
level of the experimental and control groups 
at the second and fourth administrations. 


Results of the repeated measurements analy- 
sis made within the split-plot design of Groups 
* Administration are shown in Table 4. 
again the significant A x G interaction points 
to a greater decrease in feeling-tone over time 
for the experimental as compared with the 


Once 


control group. Subclass means for the Form 
B data are shown in Table 5. The difference 


Table 4 
Analysis of Variance of Form B Checklist Data 


Source of 


Variance df 


Mean 
Square 


Groups 1 
Error (a) 198 
Administrations 1 
\xG 1 


Error (b) 


1,156.00 
21.49 
1,398.76 
65.61 
6.99 


53.79 


200,10 
9.38 


Total 
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Table 5 


Subclass Means for Form B Checklist Administrations 


Administrations 


No. 2 


Group No.4 Rows 
17.08 
14.49 


15.78 


14.15 
9.94 
12.04 


15.61 
12.21 
13.91 


Control 
Experimental 
Columns 


between the control and experimental groups 
at the second administration is significant (¢ 
= 3.98; P = .00O1); and, as was the case with 
the Form A data, the feeling-tone level of the 
control Ss showed a significant decline with 
time (t = 4.51; P = .001—between data 2B 
and 4B). 


Scale Analysis 


The procedure for scale analysis followed 
in this study was a modification of Guttman’s 
Cornell technique as described by Niven (7). 
The recent work of Menzel was further in- 
corporated (6). 

To begin with, ‘a-perfect scale was derived 
on the basis of all the checklist data. Sepa- 
rate scale analyses were then made on Data 
2B, 3A, 4A, and 4B for both experimental 
and control groups plus one analysis on the 
1A data combined for both groups. The 
merging of the control and experimental 1A 
data was justified since the “experimental” 


Ss at this point had yet to be subjected to 
different experimental conditions than the 
controls. The individual papers within each 
of these nine sets of data, having already been 
scored for previous analyses, were then or- 
dered from high to low. A scatterplot was 
then made with item responses being tallied 
in columns headed by the item-response cate- 
gories and on the same line in which the indi- 
vidual’s total score was recorded. An error 
was recorded whenever a response fell out- 
side the perfect scale pattern. Guttman’s 
coefficient of reproducibility was then deter- 
mined by computing the percentage of con- 
sistent responses. Menzel’s coefficient of scal- 
ability was next obtained. Results of this 
scale analysis (first approximation) are shown 
in Table 6. A coefficient of reproducibility 
of .90 is considered acceptable. Although no 
specific level of acceptance is presently recog- 
nized for the coefficient of scalability, Menzel 
suggests that it may be somewhere between 
.60 and .65. 

With respect to individual items, no item 
had a reproducibility coefficient low enough 
to consider dropping it from the checklist. 
Two pairs of equivalent items, however, were 
noted to have item-response frequencies simi- 
lar to other items and therefore were perform- 
ing superfluous jobs. By eliminating these 
items it was possible to maintain equivalency 
while reducing the number of items in the 
checklists to 11 each. To obtain reproduci- 


* Table 6 


Results of Scale Analyses 


Reproducibility (%) 


Second 
Approx 


imation 


First 
Approx 
N imation 


1A° 200 91.38 91.09 


3A-Expl 
4A-Expl 
3A-Cont 
4A-Cont 


100 
100 
100 
100 


89 54 
89.62 
90.00 
91,00 


89.45 
89.45 
89.45 
90.55 


87.85 
89.92 
90.31 
$9.08 


100 
100 
100 
100 


86.09 
89.18 
89.91 
88.45 


2B-Expl 
4B-Expl 
2B-Cont 
4B-Cont 


* See text for code. 


Scalability (%) 
Third 
Approx- 
imation 


Second 
Approx 
imation 


Third 
Approx 
imation 


First 
Approx 
imation 


93.20 65.43 64.23 70.37 


90.80 
90.40 
90.40 
92.40 


61.80 
57.81 
64.38 
62.01 


63.05 
58.08 
66.55 
70.31 


61.20 
56.39 
64.42 
63.25 


89.40 
90.00 
91.00 
89.60 


55.99 
57.33 
60.87 
55.76 


52.04 
54.75 
60.50 
57.09 


63.82 
55.56 
61.37 
61.34 
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bility and scalability coefficients for the 11- 
item checklist “forms,” the data were re- 
scored and then evaluated by the procedures 
of scale analysis already described. Inspec- 
tion of the results indicated that one pair of 
equivalent items was responsible for a con- 
siderable proportion of the errors of repro- 
ducibility. Consequently, this pair was elimi- 
nated. The data were then rescored on the 
basis of the 10 remaining items and re- 
evaluated by scale analysis. Results of both 
second and third approximations are shown in 
Table 6. 
Discussion 


That the checklists are equivalent and that 
their items, both individually and collectively, 
are valid has been demonstrated. It is, how- 
ever, somewhat difficult to give an unequivo- 
cal “Yes” or “No” answer to the question of 
whether the checklists are unidimensional. 
This is a problem commonly faced by an 
investigator employing scale analysis since 
many of the criteria are subjective, and it is 
the investigator alone who must provide the 
answer. The coefficients of reproducibility 


exceed or closely approach the .90 acceptance 


level. Not taken into account by Guttman’s 
coefficient is human error—that is, obviously 
misplaced checkmarks, of which there was a 
considerable amount for the population used 
in this study. On the other hand, the extreme 
response frequencies of items such as “I never 
felt fresher” and “dead tired” result in what 
is termed artificial reproducibility. Yet, it is 
argued that in the case of constructing a fa- 
tigue checklist, items from both ends of the 
continuum are required. Under nonfatigue 
conditions “extremely peppy” is functioning 
at its best, whereas “ready to drop” is not 
functioning at all; yet, under extreme fatigue 
conditions the reverse is true. The item-re- 
sponse frequencies for each checklist, obtained 
under the various conditions described, ade- 
quately cover the continuum, and should, in 
toto, reflect the subjective state for any fa- 
tigue-research situation conceivable. 


Summary and Conclusions 


Two 13-item equivalent-form fatigue check- 
lists were developed by the scale discrimina- 
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tion method. The items, individually and 
collectively, were valid, with checklist reli- 
ability being on the order of .90. Both sets 
of items, further, constituted a unidimensional 
scale according to the criteria of Guttman 
scale analysis. 

Final evaluation of the checklists, of course, 
must wait until their usefulness can be dem- 
onstrated in industrial studies. Checklist data, 
for example, may be of value in scheduling 
rest periods, in determining optimal hours of 
work, and in studying the effects of changes 
in working conditions and equipment design. 
A fatigue checklist may further have some 
“therapeutic” value in the sense of giving the 
labor force a chance to air its grievances. On 
the other hand, one should be cautioned not 
to think of subjective fatigue as a predictor 
or correlate of work output since such rela- 
tionships have not been the subject of defini- 
tive study, and consequently are little under- 
stood. 
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Faking on the Rosenzweig Picture-Frustration Study ‘ 


A. B. Silverstein * 


New York University 


The literature contains abundant evidence 
of the susceptibility of personality inventories 
to faking. Ellis (4) states that in 22 out of 
25 studies dealing with this problem, the sub- 
jects were able to achieve significantly better 
or worse scores when they were motivated to 
do so. In the case of projective techniques, 
much less research of this sort has been pub- 
lished, but the available evidence suggests that 
projective techniques, too, can be faked. Fos- 
berg (5, 6) interpreted his data to mean that 
the Rorschach withstood all attempts at ma- 
nipulation by the subjects, but his studies 
have been severely criticized by Cronbach 
(2) because of flaws in statistical method. 
Carp and Shavzin (1) attempted to verify 
Fosberg’s results, but discovered that some 
subjects could manipulate their responses, and 
so were able to vary their personality pictures 
as reflected by the Rorschach. Similarly, in 
a study of the TAT, Weisskopf and Dieppa 
(11) found that the subjects were able to 
influence the diagnoses of their personalities 
made by experienced interpreters. Again, 
Meltzoff (7) reports that his subjects were 
able to alter their responses on a sentence 


! Based on a master’s thesis for which T. N. Jen- 
kins served as research adviser. 

“Now at the Psychiatric Institute, University of 
Maryland School of Medicine. 


completion test so as to create the impression 
of either good or poor adjustment. The pres- 
ent investigation was concerned with faking 
on still another projective technique, the 
Rosenzweig Picture-Frustration Study (8). 
The purpose was to determine the nature and 
extent of changes in performance on the P-F 
Study when the subjects attempt to make a 
good or a bad impression of their personali- 
ties. 


Subjects and Procedure 


Forty-two male college students served as subjects 
for the investigation. They ranged in age from 17 
to 34 (Mean = 22), and more than half of them were 
in their sophomore year. 

The P-F Study was administered by the group 
method, under three sets of instructions: the stand- 
ard instructions that appear on the test booklet; in- 
structions for the subjects to “make the very best 
impression” of their personalities; and instructions 
for them to “make the very worst impression” of 
their personalities. All subjects took ‘the standard 
form of the test first. Immediately afterward, half 
of them took the “best impression” form and _ half 
the “worst impression” form. On the third adminis- 
tration of the test, they took the remaining form. 


Results 


The test records were scored using the re- 
vised scoring manual (10). Table 1 shows the 
means and standard deviations for the ma- 


Table 1 


Picture-Frustration Study Percentage Scores Under Three Sets of Instructions 


(N = 


Standard 


Seore Mean SD 


Ik : 14.2 


I 
M 4 8.2 


11.2 


Mean SD 


42) 


“Best Impression” “Worst Impression” 


Mean SD 


25 
35 
40 


17 
49 
34 





Rosenzweig Picture-Frustration Study 


Table 2 
Comparison of Scores Under Three Sets of Instructions 


(N = 42) 


Standard vs 
‘Best Impression” 


Standard vs 
“Worst Impression” 


Score CR 


E 5.93** 
I 4.16** 
M 4.09"* 


O-D 
k-D 
N-P 


0.83 
2.06* 
3.52%" 

* Significant at .05 level 

** Significant at .0O1 level 
jor scoring categories under the three sets of 
instructions. In Table 2, scores under stand- 
ard instructions are compared with scores un- 
der “best impression” and “worst impression” 
instructions. These comparisons are based on 
the statistical sign test (3), which does not 
depend on assumptions of normality or of 
homogeneity of variance. 

It is obvious that there were marked 
changes in almost all of the scoring categories 
under both “best impression” and “worst im- 
pression” instructions. With just one excep- 
tion (O-D), these changes were in opposite 
directions. In attempting to make a good im- 
pression, the subjects emphasized the solution 
of the frustrating problem and either sought 
to gloss over the frustration or else directed 


their aggression inward upon themselves. To, 


make a bad impression, on the other hand, 
they turned their aggression onto the environ- 
ment and placed greater emphasis on fixing 
blame for the frustration. 


Discussion 


Two trends can be observed in the data. 
The first is that the scoring categories which 
deal with type of reaction (O-D, E-D, and 
N-P) were somewhat more resistant to ma- 
nipulation than were those representing direc- 
tion of aggression (E, I, and M). In this 
connection, Rosenzweig (9) has pointed to 
the more implicit and less inferrable nature 
of the former categories, and offered the hy- 
pothesis that direction of aggression may be 
less “projective” than type of reaction. 
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The second trend which appears in the data 
is that when the subjects attempted to make 
a bad impression their scores changed more 
than when they attempted to make a good 
impression. A similar finding is reported by 
Weisskopf and Dieppa (11) in their study of 
the TAT. These authors draw the tentative 
conclusion that subjects taking the TAT may 
be attempting to make a good impression even 
when the test is administered using the stand- 
ard instructions. In this case, however, the 
attempt to make a good impression may not 
be altogether deliberate. 

Finally, certain limitations of the present 
study should be noted: (a) While the data 
show the extreme limits within which it is 
possible to alter scores on the P-F Study, they 
obviously do not indicate the amount of fak 
ing ordinarily to be expected on the test. (0) 
The study provides no information on whether 
the faked test records could be identified by 
an experienced examiner, although from the 
magnitude of the changes observed, it appears 
likely that in many cases it would be possible 
to detect faking. Despite these limitations, 
however, the extreme ease with which most 
of the test scores were altered suggests that 
the P-F Study be used with considerable cau 
tion in situations where there exists strong 
motivation for the subject to make either a 
good or a bad impression of his personality. 


Summary and Conclusions 


The Rosenzweig Picture-Frustration Study 
was administered to 42 male college students 
by the group method, under three sets of in 
structions: standard instructions, instructions 
to “make the very best impression” of their 
personalities, and instructions to ‘make the 
very worst impression” of their personalities. 
The principal conclusions are as follows: 

1. P-F Study scores are subject to consid- 
erable faking, both in the direction of making 
a good impression and in the direction of mak- 
ing a bad impression. 

2. Scoring categories which deal with direc- 
tion of aggression are somewhat more sus- 
ceptible to manipulation than those concerned 
with type of reaction. : 

3. When the subjects attempt to make a 
good impression, their scores change less than 
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when they attempt to make a bad impression. 5. Fosberg, I. A. Rorschach reactions under varied 
The results of this investigation indicate the a Rorschach Res. Exch, 1938, 3, 
need for caution in the use of the P-F Study vg 


: . 4 . . Fosberg, I. A. An experimental study of the 
with subjects who may be highly motivated to reliability of the Rorschach Psychodiagnostic 


make either a good or a bad impression. Technique. Rorschach Res. Exch. 1941, 5, 
72-84. 

. Meltzoff, J. The effect of mental set and item 
structure upon response to a projective test. 

References J. abnorm. soc. Psychol. 1951, 46, 177-189. 

1. Carp, A. L., & Shavzin, A. R. The susceptibility . Rosenzweig, S. The picture-association method 
to falsification of the Rorschach Psychodiag- and its application in a study of reactions to 
nostic Technique. J. consult. Psychol., 1950, frustration. J. Pers., 1945, 14, 3-23. 
14, 230-233. . Rosenzweig, S. Some problems relating to re- 

2. Cronbach, L. J. Statistical methods applied to search on the Rosenzweig Picture-Frustration 
Rorschach scores: a review. Psychol. Bull., Study. J. Pers., 1950, 18, 303-305. 
1949, 46, 393-429. . Rosenzweig, S., Fleming, Edith E., & Clarke, 

3. Dixon, W. J., & Mood, A. M. The statistical Helen J. Revised scoring manual for the 
sign test. J. Amer. statist. Ass., 1946, 41, 557 Rosenzweig Picture-Frustration Study. J. Psy- 
566. chol., 1947, 24, 165-208. 

4. Ellis, A. Recent research with personality in- . Weisskopf, Edith A., & Dieppa, J. J. Experi- 
ventories. J. consult. Psychol. 1953, 17, 45 mentally induced faking of TAT responses 
49. J. consult. Psychol., 1951, 15, 469-474. 


Received September 24, 1956. 





Journal of Applied Psychology 
Vol. 41, No. 3, 1957 


Prediction of Psychiatric Aide Performance ' 


Carlos A. Cuadra 
The RAND Corporation, Santa Monica, California 


and Charles F. Reed 


Princeton University 


It is a commonplace among workers in 
mental hospitals to observe that the work of 
the psychiatric aide is a crucial factor in the 
patient’s recovery from his illness. The rea- 
son for this observation is that the physician 
nominally responsible for the treatment of the 
patient usually has relatively limited contact 
with him, while the aide may spend a good 
part of his 8-hour day interacting with the 
patient in a variety of circumstances. Thus 
insofar as such contact is an important vari- 
able in social recovery, the aide is in a posi- 
tion to advance or hinder a patient’s progress, 
depending on the texture of the patient-aide 
relationship. 

In spite of the recognition of the psychiatric 
aide as a keystone of the therapeutic attack 
on mental disorder, there is a dearth of reli- 
able information on the nature of the good 
aide-patient relationship and on the person- 
ality attributes which have proved, or which 
might be considered, desirable in this group 
of hospital employees. The authors are aware 
of only one published study in this area (4), 
and its results are vitiated by unclear or ques- 
tionable methodology, including administra- 
tion of tests to employees already on the job 
and lack of cross validation of test “signs” 
which seem to have been derived from the 
original sample. 

The present study was concerned with the 
selection of psychiatric aides with the assist- 
ance of a psychological inventory reflecting a 
variety of personal beliefs, values, and social 
attitudes. It was undertaken as an effort to 
combat a problem of heavy employee turn- 
over in a large Veterans Administration hos- 
pital. Of a total of 215 aides hired during 

1 This study was conducted while the authors were 


staff psychologists at the Veterans Administration 
Hospital, Downey, Mllinois. 
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one 20-month period, 114 left the hospital 
either without completing their training pro- 
gram or within one month of its completion. 


Procedure 


The instrument chosen for this study was the Cali- 
fornia Psychological Inventory (3), a 472-item in- 
ventory for which a number of psychologically im- 
portant scales have been developed. Scoring keys 
are available for such factors as Responsibility, Tol- 


erance, Flexibility, Dominance, Impulsivity, and 
others of less immediately apparent relevance to 
psychiatric aide performance. In addition to the 


inventory items’ face-relevance to applicants for the 
job, the test is easily administered and objectively 
scored, 

The CPI was administered to each applicant for 
the position of psychiatric aide at this hospital over 
a period of two and a half years. The test was given 
at the time of application and before hiring since, as 
Ghiselli and Brown have emphasized (2), test vali- 
dation studies must be conducted on a group of 
testees representative of those on whom the test 
eventually will be used. Testing of on-the-job em- 
ployees or applicants who have already been assured 
they will be hired is likely to violate this important 
condition in indeterminable ways. The test scores 
of applicants were not made available to the person- 
nel hiring department nor were they used in any way 
in the selection or placement of the successful appli- 
cants, 

Three hundred and sixty-six applicants were tested 
in all. Of these, 332, 88 of them female, were hired 
and assigned to the required aide training program 
consisting of 5 weeks of classroom work followed by 
21 weeks of supervised ward duties 

Two criteria were selected by the investigators: 
length of service avd ward performance. While cer- 
tainly somewhat ri lated, each of these variables was 
felt to be of some significance in its own right 
Length of service was easily determinable from per- 
sonnel office records, while the adequacy of ward 
performance was determined from 1 succession of 
ratings by classroom instructors and ward super- 
visors over a period of six months. With these judg- 
ments immediately before her, the head of the psy- 
chiatric aide division of the hospital assigned each 
employee ai over-all rating of excellent, good, fair, 
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poor, or very poor. These ratings served as criteria 
of ward performance.* 

From the first 200 protocols available for analysis, 
25 were selected on the basis that the employee re- 
mained on the job less than three months. These 
employees as a group were considered to be lacking 
ing in basic motivation for the work of an aide. In 
order to insure that such was indeed the case, em- 
ployees whose records showed that early termination 
was enforced rather than voluntary were excluded 
from this group. 

These 25 protocols were then compared with 25 
protocols of employees who had remained on the job 
more than ten months. After transferring the proto- 
cols to Item Record Cards (1), an item analysis was 
carried out and those items with promising differ- 
entiating power were retained to form a motivation 
scale. 

For the ward-performance variable, the item analy- 
sis contrasted protocols of 25 employees rated “ex- 
cellent” or “good” and 25 employees rated “poor” 
or “very poor.” The group of more proficient em- 
ployees was virtually identical with the long-tenure 
group used for the motivation item analysis, since 
excellent workers are almost necessarily well moti- 
vated and in addition receive recognition and finan- 
cial rewards which promote longer tenure. There 
was relatively littke overlap, however, between the 
less proficient workers and the short-tenure group, 
in part because dismissed workers were excluded by 
the investigators from the short-tenure group and in 
part because some of them remained so short a time 
that no evaluation of ward performance was possible. 

A total of 43 CPI items emerged from the item 
analysis with some promise of differentiating short 
tenure from long-tenure aides. For the ward-per- 
formance variable, 47 items reached the comparable 
level of significance (.10). The two scales had 10 
items in common. 

It is customary, in reporting the development of 
some new measure, to indicate its efficiency on the 
sample of cases on which it was originally derived 
Adherence to this procedure would show the motiva 
tion scale to be very promising, since it was able to 
separate long- and short-term employees with a CR 
of 7.1. The scale also differentiated good- and poor- 
performance aides, with a CR of 63. The ward- 
performance scale also separated the criterion groups 
well, doing better at separating good- and poor-per- 
formance aides (CR = 9.6) than long- and _ short- 
term aides (CR = 4.2). Use of an appropriate cut 
ting score on the motivation scale eliminated all 
short-tenure aides, with only two false positives 
The performance scale was able to eliminate all poor- 
performance aides, with five false positives, the N in 
both cases being 60. 

Differentiation between original criterion groups is 


* We wish to express our deep appreciation to Mrs. 
Mary W. Heaney, R.N., who made all of the cri- 
terion ratings over a period of two years. Without 
her capable assistance, this research could not have 
been done. 
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a necessary but by no means sufficient condition to 
establish the merit of any scale. It is obvious that 
the method of construction used (selection of differ- 
entiating items) will make it inevitable that any such 
scale will separate the criterion groups. However, 
the degree of overlap is not predictable in advance, 
being a function not only of the discriminating power 
of the individual items but also of their interrela- 
tionships. The really critical test of a scale, how- 
ever, must be its power in an entirely new sample 
of cases. After scoring keys were constructed, new 
protocols not used in the original item analysis were 
scored for both variables. The scores were then used 
as a basis for predicting (a) tenure and (b) ward 
performance. 
Results 


No relationship whatever was found be- 
tween predictions and actual job tenure and 
performance in the cross-validation sample. 
Distributions of motivation and ward-per- 
formance scores were virtually identical for 
excellent, good, fair, and poor aides. 

A second item analysis was undertaken on 
a new group of good-versus-poor-performance 
aides and produced a new group of “signifi- 
cant” items only one of which had appeared 
in the original item analysis. Of 27 “good 
performance’ items significant at the .05 level 
in the original subsample item analyzed, 14 
were scored in the same direction in the new 
sample and 11 were scored in the opposite di- 
rection, with 2 items:showing no difference in 
the proportion of true to false answers. 


Discussion 

Reasons for the complete collapse of two 
carefully derived scales in the process of cross- 
validation are not difficult to imagine, al- 
though substantiating them is an entirely dif- 
ferent matter. 

It may barely be possible that the job ten- 
ure and ward performance of psychiatric aides 
are not related in any consistent fashion to 
personality variables, or if they are, that the 
relationship is so weak as to be overshadowed 
by intellectual, experiential, or other variables 
not directly measured in our study. Such an 
hypothesis is one which few chief nurses, 
charge aides, or personnel officers would ac- 
cept, however, and there is as well no evidence 
for it. 

There is, secondly, the possibility that the 
CPI is not an appropriate instrument with 
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which to assess the particular personality vari- 
ables which may be involved here. While it 
includes measures of a wide variety of psy- 
chological dimensions, it may not allow suffi- 
cient room for behavior pathology to express 
itself. One of its seeming virtues—namely, 
its relative freedom from the frankly patho- 
logical content of its chief predecessor, the 
Minnesota Multiphasic Personality Inventory 
—may in this situation be a limitation. Some 
support for this belief comes from an unpub- 
lished study by the authors in which it was 
discovered that the mean CPI scores for a 
group of 30 psychotic hospitalized patients 
were not significantly different from the means 
of normal persons. 

A third, and perhaps the greatest, source of 
difficulty has to do with choosing adequate 
criteria of performance against which to de- 
velop effective measuring devices. We are 
not referring only to our use of a single over- 
all rating of performance, about which there 
could conceivably be some objection, but also 
to the more basic problem of whether there is 
such a thing as a “good aide.” Is there, we 
may ask, a single constellation of personal 
attributes which promises effective perform- 
ance on a psychiatric ward? 

The wards of a large psychiatric hospital 
present a wide variety of experiences for and 
demands on an aide. On some, effective ‘ad- 
ministration” of sedentary chronic patients 
may overshadow any need for acute observa- 
tion, rapport-building, or persuasive talents, 
such as are usually felt invaluable on an acute 
treatment ward. Thus an aide rated “excel- 
lent” in the former capacity might fail com- 
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pletely in (for him) a more demanding 
position; conversely, some aides who are ex- 
cellent workers in an active setting with 
younger, more disturbed patients, may lose 
interest and effectiveness in a situation which 
offers no real challenge to their abilities. 

It may be necessary, in attempting to pro- 
vide an effective screening device for aide se- 
lection, to specify part criteria such as effec- 
tiveness as a therapy assistant, efficiency as 
an administrator, apparent understanding of 
patient needs, etc. The criteria may be those 
imposed as the role of the aide shifts from 
custodial to therapeutic emphasis. The pres 
ent effort was based upon the criteria pres- 
ently in effect in judging work performance. 
The unsuitability of a promising device for 
prediction of those criteria is an empirical 
fact. Closer specification of criteria of aide 
performance seems indicated in future efforts 
in aide selection. 
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In a recent paper (2) Dingman and Guilford 
have considered and proposed a solution to 
the problem of forming a composite rating 
when the correlation matrix between raters is 
of unit rank. In the particular situation with 
which Dingman and Guilford were concerned, 
four supervisors rated 716 psychiatric tech- 
nicians as to their effectiveness. Upon anal- 
ysis 4 single common factor was found sufficient 
to account for the matrix of correlations be- 
tween raters. From the four sets of ratings 
Dingman and Guilford sought to form a single 
composite rating? which would be as valid and 
as reliable as possible, where the precise mean- 
ing of “validity” remained to be determined. 
In the end they proposed that each rater’s 
contribution be weighted for his loading on 
the single common factor. To be more spe- 
cific, let x,; be the rating of rater g on ratee j 
and suppose that the ratings of each rater 
have been standardized. Then, the custom- 
ary, unit-weight composite for ratee 7 becomes 


x; - v1 { { Xmj- 


Dingman and Guilford proposed that instead 
of x we use the factor-weight composite 


xj’ = Myxyy + hexe) + +++ + Ines, 


where hy, ha, «++, hm represent the factor load- 
ings of the m raters. The purpose of this 
paper is to press the logic underlying the 
Dingman-Guilford proposal somewhat further. 
The effect will be to arrive at a somewhat 


' Opinions or conclusions contained in this paper are 
those of the author. ‘They are not to be construed as 
necessarily reflecting the view or the endorsement of 
the Navy Department. 

* Actually, Dingman and Guilford considered a more 
complicated situation than the one with which we shall 
deal. The complication consisted in their having the 
raters give for each rating of effectiveness a rating of 
the assurance with which they made the effectiveness 
rating. These ratings of assurance were included in 
their proposed composite together with the ratings of 
effectiveness. In this paper we consider composites of 

. the basic rating only. The considerations, however, 
‘which we shall raise concerning the simpler composite 
are easily extended to more complicated composites 
and with much the same effect. 


different solution to the problem of composite 
ratings in the case of unit rank. 

Underlying the Dingman-Guilford argument 
is the identification of the single common factor 
F with the trait under study. This identifica- 
tion is tantamount to accepting F as a validity 
criterion. To the writer this convention seems 
quite plausible. However, the purpose of this 
paper is neither to defend nor attack the Ding- 
man-Guilford identification but to trace out 
its consequences. Consider, therefore, the 
customary composite x and the Dingman- 
Guilford composite x’. Traditionally, there 
are two grounds on which we might prefer x 
to x’ or vice versa: validity and reliability. 
We will take up validity first. 

Under the convention adopted, validity is 
defined by correlation with F. Though Ding- 
man and Guilford do not make the point, the 
validity of x’ is necessarily greater than or 
equal to the validity of x. Nevertheless, 
weighting the raters for their factor loadings 


does not maximize the correlation with F. 
a , ‘ hy ian 
he maximal weights’ are -. The com- 
1 — h,? 
posite 
hy Ny 
” od 
x; — Xj + eee +} 


re 
1 — h,?”™ 


1 = h;? 


correlates maximally with Ff. In the rest of 
this paper, as in this discussion, we suppose 
that the interrater matrix is of exactly unit 
rank. In effect, the observed correlations are 
replaced by the correlations reproduced from 
the factor pattern. The factor estimates ob- 
tained by this method are just as accurate as 
those obtained by the complete estimation 
method using observed correlations (4, p. 265). 
More importantly perhaps, by using the repro- 
duced correlations we do not bias the argument 
either for or against the three composites we 
are considering: x, x’, and x’. 
sions we reach are general. 


The conclu- 


§ To obtain these weights we solved a general expres 
sion for the estimation of common factors (4, p. 279) 
for the special case of unit rank. 
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Conceivably, the increase in validity ob- 
tained by using x” instead of x’ might not 
justify the increase in computational difficulty. 
However, as is clear from the definition of x” 
its calculation is only a trifle more difficult 
than the calculation of x’. Insofar as validity 
is concerned, therefore, the same considera- 
tions which lead us to prefer x’ to x lead us to 
prefer x’ to x’. 

An analysis of hypothetical self-correlation 
(1), which is reliability as we use the term, 
does not issue in so neat a conclusion. The 
reason is that hypothetical self-correlations, 
whether of individual raters or of composites, 
cannot be determined from a single administra- 
tion. The data which Dingman and Guilford 
considered may be regarded as the result of 
administering an m-item test once to a large 
population of subjects. As Guttman (3) has 
pointed out, from a single administration of a 
test we cannot determine hypothetical self- 
correlations, either of individual items, i.e., 
raters, or of composites of raters. The best 
we can do is to calculate lower bounds to the 
reliability coefficient ; and the greatest of these 
bounds is what Cronbach (1) has called the 
“coefficient of equivalence.” Cronbach de- 
fines this coefficient as “‘the degree to which 
the test score indicates the status of the indi- 
vidual at the present instant in the general 
and group factors defined by the test.” In 
the case at issue “the general and group factors 
defined by the test” reduce to the single com- 
mon factor F. In consequence, “the degree 
to which the test score indicates the status of 
the individual’? on F becomes r’,-, r°,p, or 
rp depending upon whether the “‘test score”’ 
is x, x’, orx’’. As we have already seen, how- 
ever, r’,p is greater than either of the other 
two coefficients. Therefore, though we can- 
not assert that the reliability of x”’ is greater 
than the reliability of x’, we can assert that 
the greatest lower bound to the reliability of 
x’, its coefficient of equivalence, is greater 
than the greatest lower bound to the reliability 
of x’. Insofar as we are able to judge, x” is 
to be preferred over x’ for reasons of reliability 
as well as of validity. 

The approach we have taken to the reli- 
ability of x’ is not quite the same as the ap- 
proach adopted by Dingman and Guilford. 
Recognizing that the reliability of their com- 
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posite could not be estimated from the data 
at hand, Dingman and Guilford considered 
what they called “inter-rater consistency or 
inter-composite consistency.” To _ illustrate 
their meaning let a, 6, ---, ¢ and a, B, «++, @ 
be two mutually exclusive but not necessarily 
exhaustive sets of raters. Define 


Vj = Xaj + bj + = + Xt; 
+ hyxy; 


oo hy 
Vy = ee yh tj" 


Vi = heXay + horn; 4 


Similarly, define 


2j = Xai t+ Xp; + 


3, RaXus + hgxa, { 


he 
| het ™ ! 
The correlations r,., fy2, and tye. may be 
taken as measures of the intercomposite con 
sistency of x, x’, and x”, respectively. The 
two sets of raters, a, -, and a, -, o are 
not, of course, the only sets of raters into 
which the set of all the raters may be divided ; 
and to every pair of sets there corresponds a 
trio of coefficients: r,2, tye, and tyre. Thus, 
intercomposite consistency does not refer to 
any single coefficient but to as many coeffi- 
cients as there are intercomposites which may 
be formed from the total composite. Actually, 
Dingman and Guilford considered all inter- 
composites, three in number, in which both 
sets of raters included two and only two raters. 
In a subsample of 50 psychiatric technicians 
they calculated the coefficients r,., from x’ 
and compared them with the corresponding 
coefficients r,, from x. In every instance, the 
coefficient from x’ was substantially greater 
than its counterpart from x. ‘The coefficients 
from x”, however, are easily shown to be 
greater than or, in special cases, equal! to the 
coefficients from x and x’. The relations 
Pye Te 

rye Tp 
Py =fyw a 
are easily established by algebraically expand- 
ing the correlation coefficients involved. How 
ever, we know that r’y-” and #’,” represent 
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the validity coefficients corresponding to the 
most valid composites which can be formed 
from raters a, b, -- 
tively. 


-, Land a, B, «++, 0, respec- 
Therefore, rye > rye or yp and 
rep > Pee OF Pp; OF, More pertinently, 
Py 2 ye or r,,. In summary, then, a 
consideration of intercomposite consistency 
leads to the same conclusion as did the con- 
sideration of validity and reliability: x’ is to 
be preferred over x’ for precisely the same 
reasons that x’ is to be preferred over x. 

The purpose of this paper has been to 
examine the consequences of the Dingman- 
Guilford approach to the formation of a com- 
posite rating in the case of unit rank. The 
essence of the Dingman-Guilford approach 
is the identification of the single common 
factor with the trait being rated. A_ first 


consequence of this identification is that the 
Dingman-Guilford composite x 


’ is superior to 
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the customary unit-weight composite from both 
a validity and a reliability point of view. Ex- 
tending the argument, however, assigning the 
raters those weights which maximize the cor- 
relation with the single common factor yields 
a composite which is superior to x’ on the 
same counts and for the same reasons that x’ 
is superior to the customary composite. 


Received October 5, 1956. 
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